Cfrln: Incidents

6000 Harvard applicants’ personal data on Bittorrent

Harvard just learned security investigation 101 the hard way.

Harvard admitted yesterday that a web server was hacked a month ago that contained financial application data for over 10,000 applicants. They knew about the incident on February 15 and took down the server till February 21 in order to investigate and implement stronger security controls. Their announcement reveals how slow and ineffective security investigations often are.

“The University’s initial examination did not reveal the full extent of the hack. As the investigation continued, it became apparent that some sensitive applicant data, including Social Security numbers, could potentially have been accessed.”

Unfortunately, a day later, it was pretty obvious that over 6,000 applicants’ data had been compromised - CNet reports that all their personal data was on Bittorrent.

“Harvard officials said the data includes the applicant’s name, Social Security number, date of birth, address, e-mail address, phone numbers, test scores, previous school attended, and school records.”

Ouch.

It shouldn’t have taken Harvard nearly a month to come up with an answer as weak as “could potentially have been accessed.”

Facebook, privacy and IT data

Facebook is getting a lot of flak in the press (latest in the Register) about reports on a gossip blog about some pretty serious privacy holes:

1. anyone that works there can look at anyone’s private profile

2. anyone who works there can look at logs of what other profiles any user has seen.

If Facebook wants to turn their act around, or any other social networking site wants to avoid being in their position, they’d better pay attention to some best practices around securing and reviewing IT data.

Here’s what best practice would say about Facebook’s two problems.

The first problem - anyone can look at any customer’s data - is classically the kind of thing that has brought on regulations in other industries, such as PCI-DSS, which was introduced by VISA to ensure that merchants processing credit cards keep consumer financial info private. Like credit cards, a lot of the information people post to their private profiles is a goldmine for identity thieves - Information Week made this argument about Faceboook even before the latest flap. If I know your birthdate and mother’s name I’m a lot further along in social engineering an unwitting customer support rep into believing I’m you. And yes, identity thieves do have insiders - ask Ford Motor Credit.

Complexity and failures in the NYT

I’ve been posting occasionally when there’s some huge meltdown of a big service like the two recent Blackberry outages. My point is usually that the systems are too complex so the failure mode is usually unpredictable and hard to track down - hence the sputtering of PR people days after big outages while sysadmins are frantically digging through logs, configs and system metrics all over the place.

Anyway, looks like the NYT picked up on the same idea. Good article citing recent outages at United and Skype and tying them into the larger problem of increasing system complexity.

It quotes Andreas Antonopolous, who’s been one of the analysts to really understand why IT Search is necessary in the face of increasing chaos and change in the datacenter. Here’s a video clip hosted on splunk.com of him talking about this.

The logs behind the Fox Fark hack

Valleywag (the Silicon Valley Gossip site recently upgraded by means of well-known tech business reporter Owen Thomas becoming the valleywag), posted a detailed log event by log event account of the investigation by Drew Curtis, Fark’s founder, who figured out that a would-be hacker was a Fox news reporter.

The basic correlation technique is one I first heard of several years ago from an online banking hosting company’s security team - basically you figure out that the same IP address is logging into multiple accounts and probably controls both of them. The specifics are a little different but the problem is basically the same.

The trick is that email or web server logs have the IP address that hit you, with session IDs or timestamps you need to correlate to other app logs that have the user accounts.

In the Fark case this correlation showed that the account that was responsible for the bad action was the same person as an account that was identifiably that of the Fox news guy.

In the online banking case it was a way to detect phishing rings - if one Ukrainian Internet cafe’s IP hits 10 accounts at an American regional bank in an hour… probably not legit.

$1 billion market cap loss due to service problems. Ouch.

This one’s even worse than taking Ebay’s market cap down $1 billion yesterday.
Why do outages last this long? Because it’s too hard to find out where the problem happened.

Skype finally posted that the issue was a problem in their networking code at 10 p.m. last night, about a full day after the problem started, while rumours flew around that they’d been hacked. I bet it took Skype that full day to find that the problem was with the networking code. Why? Because if Skype is anything like any other big IT operation I talk to, dozens of admins were spending the day writing and running slow one-off scripts and testing various hypotheses against log data, configurations, code, scripts and the like scattered around the thousands of servers that would be behind a service of this scale.

If you work at Skype and I’m wrong about that, please let me know. But I might not believe you.

So how should IT shops avoid this?

  • Capture all logs, configuration changes, script changes and source code revisions in real time in a central place
  • Index it all so you can search it fast

Wireless meltdowns Thursday - shoulda Splunked it!

Nearly everyone at Splunk fell victim to a series of wireless meltdowns yesterday evening - across three different carriers. Cingular was down for 4 hours in the San Francisco Bay Area due to a “software glitch.” Verizon and T-Mobile Blackberries were delivering email 6-12 hours late.

(The local CBS station picked up on Cingular’s outage. In the humor department, their ad server was showing a Cingular Wireless ad below the story when I looked this morning.)

This is *exactly* the reason smart operations and development teams are picking up Splunk. Why does a software glitch leave a major wireless carrier offline for 4 hours? It’s a guess, but a pretty safe one, that there were sweating sysadmins copying and emailing logfiles and configurations and running diagnostic commands on hundreds of servers while impatient developers who could actually debug things waited for the data to trickle in.

I bet those developers would have found the problem a lot faster if they had real-time search access to all of the production data.

Anyone with information that’s more than a guess about this? Would love to hear from you in the comments.