Cfrln: Homepage

Tell us your Splunk story at Interop

Are you planning on being at Interop in Vegas April 27-May 2? Do you use Splunk? If so, I’d love to hear from you.

I’ll be there with the Splunk video team and we’d love to record some new interviews with Splunk users. If you haven’t seen some of the user interview videos we’ve already done, check them out. They’re the best way to learn about how Splunk’s getting applied in the real world.

Some of my favorites: Demetri Mouratis, Rhythm New Media, using Splunk as an IT data platform across business and operations teams; Allen Hecker and Mark Bronniman, the senior security analyst and senior unix admin at Weill Cornell Medical College, and Trevis Edgworth of Epsilon Data Management, using Splunk for network security, compliance, insider threat and network operations.

Just email me at cfrln@splunk.com and let me know when you’re available. We’ll make it fast, and there’ll be a fine Splunk jacket on its way to you when you’re done.

6000 Harvard applicants’ personal data on Bittorrent

Harvard just learned security investigation 101 the hard way.

Harvard admitted yesterday that a web server was hacked a month ago that contained financial application data for over 10,000 applicants. They knew about the incident on February 15 and took down the server till February 21 in order to investigate and implement stronger security controls. Their announcement reveals how slow and ineffective security investigations often are.

“The University’s initial examination did not reveal the full extent of the hack. As the investigation continued, it became apparent that some sensitive applicant data, including Social Security numbers, could potentially have been accessed.”

Unfortunately, a day later, it was pretty obvious that over 6,000 applicants’ data had been compromised - CNet reports that all their personal data was on Bittorrent.

“Harvard officials said the data includes the applicant’s name, Social Security number, date of birth, address, e-mail address, phone numbers, test scores, previous school attended, and school records.”

Ouch.

It shouldn’t have taken Harvard nearly a month to come up with an answer as weak as “could potentially have been accessed.”

Facebook, privacy and IT data

Facebook is getting a lot of flak in the press (latest in the Register) about reports on a gossip blog about some pretty serious privacy holes:

1. anyone that works there can look at anyone’s private profile

2. anyone who works there can look at logs of what other profiles any user has seen.

If Facebook wants to turn their act around, or any other social networking site wants to avoid being in their position, they’d better pay attention to some best practices around securing and reviewing IT data.

Here’s what best practice would say about Facebook’s two problems.

The first problem - anyone can look at any customer’s data - is classically the kind of thing that has brought on regulations in other industries, such as PCI-DSS, which was introduced by VISA to ensure that merchants processing credit cards keep consumer financial info private. Like credit cards, a lot of the information people post to their private profiles is a goldmine for identity thieves - Information Week made this argument about Faceboook even before the latest flap. If I know your birthdate and mother’s name I’m a lot further along in social engineering an unwitting customer support rep into believing I’m you. And yes, identity thieves do have insiders - ask Ford Motor Credit.

Splunk as job qualification

This is a fun trend for us here at Splunk - more and more job descriptions are listing Splunking skills as a plus. Really rewarding for those of us who’ve been here since before the 2005 beta!
Here are a few jobs that want you to know your Splunk:

Got any more? Post ‘em in the comments!

Automating and opening up product planning

The PM and engineering teams are embarked on an interesting experiment here at Splunk. While we’ve always leveraged the support case system to track enhancement requests and automate some of the input end of the product management process, the real meat of product definition has happened pretty much as it does anywhere - via product requirements documents (PRDs) written by PMs and answered by a variety of technical specifications, bugs and tasks in the engineering tracking system, emails, whiteboard sessions, etc.

OK, it’s Splunk, so the PRDs and tech specs have always been on the corporate wiki so there’s some measure of collaboration. Anyone in the company could go up there and have a look at what was in progress. But it’s been pretty difficult to keep PRDs and specs fully up to date while we’ve been innovating as quickly as we have since the initial launch of the product in 2005. And it’s been impossible to give our customers and field sales engineering teams the level of transparency we want in order to get their full involvement.

Our public roadmap has to be created manually and is of necessity fairly high level and updated only every month or so. The other PMs and I are constantly fielding a barrage of “what’s the status of this feature?” questions.

Complexity and failures in the NYT

I’ve been posting occasionally when there’s some huge meltdown of a big service like the two recent Blackberry outages. My point is usually that the systems are too complex so the failure mode is usually unpredictable and hard to track down - hence the sputtering of PR people days after big outages while sysadmins are frantically digging through logs, configs and system metrics all over the place.

Anyway, looks like the NYT picked up on the same idea. Good article citing recent outages at United and Skype and tying them into the larger problem of increasing system complexity.

It quotes Andreas Antonopolous, who’s been one of the analysts to really understand why IT Search is necessary in the face of increasing chaos and change in the datacenter. Here’s a video clip hosted on splunk.com of him talking about this.

The logs behind the Fox Fark hack

Valleywag (the Silicon Valley Gossip site recently upgraded by means of well-known tech business reporter Owen Thomas becoming the valleywag), posted a detailed log event by log event account of the investigation by Drew Curtis, Fark’s founder, who figured out that a would-be hacker was a Fox news reporter.

The basic correlation technique is one I first heard of several years ago from an online banking hosting company’s security team - basically you figure out that the same IP address is logging into multiple accounts and probably controls both of them. The specifics are a little different but the problem is basically the same.

The trick is that email or web server logs have the IP address that hit you, with session IDs or timestamps you need to correlate to other app logs that have the user accounts.

In the Fark case this correlation showed that the account that was responsible for the bad action was the same person as an account that was identifiably that of the Fox news guy.

In the online banking case it was a way to detect phishing rings - if one Ukrainian Internet cafe’s IP hits 10 accounts at an American regional bank in an hour… probably not legit.

$1 billion market cap loss due to service problems. Ouch.

This one’s even worse than taking Ebay’s market cap down $1 billion yesterday.
Why do outages last this long? Because it’s too hard to find out where the problem happened.

Skype finally posted that the issue was a problem in their networking code at 10 p.m. last night, about a full day after the problem started, while rumours flew around that they’d been hacked. I bet it took Skype that full day to find that the problem was with the networking code. Why? Because if Skype is anything like any other big IT operation I talk to, dozens of admins were spending the day writing and running slow one-off scripts and testing various hypotheses against log data, configurations, code, scripts and the like scattered around the thousands of servers that would be behind a service of this scale.

If you work at Skype and I’m wrong about that, please let me know. But I might not believe you.

So how should IT shops avoid this?

  • Capture all logs, configuration changes, script changes and source code revisions in real time in a central place
  • Index it all so you can search it fast

Splunk Professional Services - hire us, join us

Since Splunk is so easy to install and get started with most people do their initial Splunk deployment on their own. Unlike a big complicated piece of operations or security bloatware, it pretty much just works.

But a lot of companies I talk to have a backlog of things they’ve been meaning to add to their deployment - maybe some tweaks to how they’re indexing their data, maybe scaling out their deployment to a lot more servers, devices or applications. I also often hear they want to train a lot of new users in their organization on Splunk so they stay off of production machines.

These companies’ admins often don’t realize they don’t have to do it all on their own. Splunk has a professional services team of experts who can come onsite and do anything from formal design of large deployments to installation, configuration and customization.

We also offer education classes, so we can train your users so you don’t have to.

Just contact sales@splunk.com and ask about services and education packages and pricing.

Also, we’re hiring. If you’re reading this and thinking that working with lots of Splunk customers to deploy Splunk in different ways sounds like a dream job, get in touch with me! (But only if you’re really really good.) cfrln@splunk.com.

Wireless meltdowns Thursday - shoulda Splunked it!

Nearly everyone at Splunk fell victim to a series of wireless meltdowns yesterday evening - across three different carriers. Cingular was down for 4 hours in the San Francisco Bay Area due to a “software glitch.” Verizon and T-Mobile Blackberries were delivering email 6-12 hours late.

(The local CBS station picked up on Cingular’s outage. In the humor department, their ad server was showing a Cingular Wireless ad below the story when I looked this morning.)

This is *exactly* the reason smart operations and development teams are picking up Splunk. Why does a software glitch leave a major wireless carrier offline for 4 hours? It’s a guess, but a pretty safe one, that there were sweating sysadmins copying and emailing logfiles and configurations and running diagnostic commands on hundreds of servers while impatient developers who could actually debug things waited for the data to trickle in.

I bet those developers would have found the problem a lot faster if they had real-time search access to all of the production data.

Anyone with information that’s more than a guess about this? Would love to hear from you in the comments.