Get Splunk
Splunk.com  |  Splunk Base  |  Splunk Blogs

Tell us your Splunk story at Interop

Posted:  April 16th, 2008
Tags:  Splunk

Are you planning on being at Interop in Vegas April 27-May 2? Do you use Splunk? If so, I’d love to hear from you.

I’ll be there with the Splunk video team and we’d love to record some new interviews with Splunk users. If you haven’t seen some of the user interview videos we’ve already done, check them out. They’re the best way to learn about how Splunk’s getting applied in the real world.

Some of my favorites: Demetri Mouratis, Rhythm New Media, using Splunk as an IT data platform across business and operations teams; Allen Hecker and Mark Bronniman, the senior security analyst and senior unix admin at Weill Cornell Medical College, and Trevis Edgworth of Epsilon Data Management, using Splunk for network security, compliance, insider threat and network operations.

Just email me at cfrln@splunk.com and let me know when you’re available. We’ll make it fast, and there’ll be a fine Splunk jacket on its way to you when you’re done.

Permalink   |   No Comments

P-Camp preso on automating product management with Jira

Posted:  March 17th, 2008
Tags:  Splunk

Here’s the presentation that I gave this past Saturday at P-Camp, the unconference for product managers. If you’ve been following what we’re doing here with automating product management using Jira, there’s detail and screenshots in this presentation that might be interesting.



Permalink   |   1 Comment

6000 Harvard applicants’ personal data on Bittorrent

Posted:  March 13th, 2008
Tags:  Splunk

Harvard just learned security investigation 101 the hard way.

Harvard admitted yesterday that a web server was hacked a month ago that contained financial application data for over 10,000 applicants. They knew about the incident on February 15 and took down the server till February 21 in order to investigate and implement stronger security controls. Their announcement reveals how slow and ineffective security investigations often are.

“The University’s initial examination did not reveal the full extent of the hack. As the investigation continued, it became apparent that some sensitive applicant data, including Social Security numbers, could potentially have been accessed.”

Unfortunately, a day later, it was pretty obvious that over 6,000 applicants’ data had been compromised - CNet reports that all their personal data was on Bittorrent.

“Harvard officials said the data includes the applicant’s name, Social Security number, date of birth, address, e-mail address, phone numbers, test scores, previous school attended, and school records.”

Ouch.

It shouldn’t have taken Harvard nearly a month to come up with an answer as weak as “could potentially have been accessed.”

Why couldn’t they figure out for sure whether the data was accessed? Either they weren’t logging file accesses, didn’t have the logs, or the logs were too hard to analyze. Most likely a combination of all three.

Maybe they could learn from Splunk customer Weill Cornell Medical College - here’s a video of Mark Bronniman, the senior Unix administrator there, and Alan Hecker, their senior security engineer talking about using Splunk to accelerate security investigations. In fact, they implemented Splunk first to speed up an investigation that was in progress.

Permalink   |   No Comments

Product management nirvana

Posted:  January 20th, 2008
Tags:  Splunk

A few months ago I wrote about our effort to automate and open up product planning by implementing a process around distilling product inputs into requirements using Jira in support of an agile/scrum based development model. I’ve rarely had so much response to a post… dozens of product managers at companies large and small wrote me and commented about their own efforts along the same lines. Many asked for our specs on our Jira customizations.

We were at the beginning of this effort when I wrote that post. In the intervening 3+ months we’ve completed the first round of Jira customizations (thanks to lots of help from Dave Pickering and the team at New Aspects of Software, a fantastic consulting firm specializing in Jira - these guys do what they say they’ll do, when they say they’ll do it, for the amount of money they said they’d charge.) My tireless PM teammates have been embracing the new system and putting in the late nights to coalesce all of the feedback into common problem statements and requirements.

The work all came together for us this past week as we head into the next round of product planning and are reforming our scrums and confirming business priorities - we had a hefty but complete “PRD” that was automatically generated from Jira and represented a comprehensive view of product requirements and concepts. The PM team took about 4 hours to walk through it, confirm our initial priority cut, and then we had an incredibly productive series of sessions with the full business and product leadership to decide which problems to tackle next, how to reform the scrum teams, and what priorities to give each scrum to start with. This was the most ordered and efficient product priority setting exercise I’ve ever been through, because we were dealing with the complete picture.

The PRD report was a custom report built for us by New Aspects. It lets you filter our custom “problem statement” issue type by priority, text matches and other fields; then it prints all problems in reverse priority order. An example of a problem statement would be something like “Splunk doesn’t support the fibberziggy filesystem”, with all ERs asking for fibberziggy filesystem support linked to that problem statement. Each problem shows other linked issues including:

  • Inputs: Enhancement requests, Market data points, Call reports (each with details like customer name, deal value, etc.)
  • Requirements - with details of status, so we could understand where we were on problems that were already partially addressed by past development
  • Features
  • Child problem statements (they cascade)

Now the scrum teams are off to do their individual sprint planning and requirements development, with all of that to be captured in Jira as they go. The common “PRD” will stay up to date as they work independently, and best of all, our SEs can see all the way from their individual customer enhancement requests through to up-to-the-minute status of requirements definition and completion. Hardly the case with the old document based PRDs PMs used to create.

Next up we’re going to tackle automating cascading updates based on requirements status updates. For example, if QA validates a completed requirement that Splunk lock test succeeds on the fibberziggy filesystem, we want Jira’s workflow to check that this was the last requirement for the problem “Splunk doesn’t run on fibberziggy filesystems”, close that problem, then check to see if that problem was the last problem for each linked enhancement request, and close those enhancement requests and update our Sugar CRM system via our email integration. We even want interim updates, such as flowing back when we’ve fully scoped requirements for a problem.

We’re also talking to New Aspects about packaging up all our custom Jira reports, workflows and security schemes and giving it to the community, so look here for a post when it’s ready for download.

Permalink   |   2 Comments

Facebook, privacy and IT data

Posted:  October 29th, 2007
Tags:  Splunk

Facebook is getting a lot of flak in the press (latest in the Register) about reports on a gossip blog about some pretty serious privacy holes:

1. anyone that works there can look at anyone’s private profile

2. anyone who works there can look at logs of what other profiles any user has seen.

If Facebook wants to turn their act around, or any other social networking site wants to avoid being in their position, they’d better pay attention to some best practices around securing and reviewing IT data.

Here’s what best practice would say about Facebook’s two problems.

The first problem - anyone can look at any customer’s data - is classically the kind of thing that has brought on regulations in other industries, such as PCI-DSS, which was introduced by VISA to ensure that merchants processing credit cards keep consumer financial info private. Like credit cards, a lot of the information people post to their private profiles is a goldmine for identity thieves - Information Week made this argument about Faceboook even before the latest flap. If I know your birthdate and mother’s name I’m a lot further along in social engineering an unwitting customer support rep into believing I’m you. And yes, identity thieves do have insiders - ask Ford Motor Credit.

A major measure that organizations who are following best practices for privacy are supposed to take is to lock down this private information to only insiders with a need-to-know - obviously Facebook’s not doing that. But once they do put the right access controls in place, they’re going to need to put in a review procedure to watch privileged employees. Facebook’ security or privacy staff should be reviewing logs of who has accessed private info and ensuring that there was a valid business reason for each access. The review should include:

  • logs generated by Facebook’s application itself to see employees with admin access coming in the front door
  • audit tables for the back end databases to be sure that the database admins who manage the database back-end aren’t bypassing the application’s permissions and doing manual queries to see what they shouldn’t
  • filesystem audit logs, to be sure that server or storage admins aren’t bypassing both the database and the app to look at the data on the filesystem itself

The second problem - that any employee can look at logs of what users have done - is a bit less well understood privacy issue. It’s probably particularly bad on a social networking site - do you really want your ex knowing you’re watching their profile? But you may not want every Amazon employee being able to see what items you’re browsing, so it’s an issue that affects almost any site to some degree.

To address the second issue, logs themselves need to be securely captured into a system that provides appropriate access controls to the logs themselves as well as an audit trail of who’s looked at the logs - which the security team should be reviewing proactively. Unfortunately, access logs are hardly ever considered to have privacy implications inside large sites. As evidenced by last year’s infamous publication of AOL search records.

Keeping these logs around that show who looked at what is going to be important too - law enforcement could subpoena Facebook for logs if unauthorized access by their employees is suspected to be a part of a criminal act. Facebook won’t want to be in a position where they can’t produce the logs.

The biggest reason Facebook should take this seriously? An overzealous plaintiff’s attorney somewhere is probably salivating over all the cash they raked in from Microsoft and figuring out how to sue Facebook for cash damages if a Facebook privacy breach leads to financial losses or serious personal harm, using the argument that by not following the same standard as other sites they’ve not met their “duty of care.” Think they can’t do it? TJ Maxx is getting sued right now on similar grounds.

Permalink   |   5 Comments

Splunk as job qualification

Posted:  October 5th, 2007
Tags:  Splunk

This is a fun trend for us here at Splunk - more and more job descriptions are listing Splunking skills as a plus. Really rewarding for those of us who’ve been here since before the 2005 beta!
Here are a few jobs that want you to know your Splunk:

Got any more? Post ‘em in the comments!

Permalink   |   No Comments

Automating and opening up product planning

Posted:  September 15th, 2007
Tags:  Splunk

The PM and engineering teams are embarked on an interesting experiment here at Splunk. While we’ve always leveraged the support case system to track enhancement requests and automate some of the input end of the product management process, the real meat of product definition has happened pretty much as it does anywhere - via product requirements documents (PRDs) written by PMs and answered by a variety of technical specifications, bugs and tasks in the engineering tracking system, emails, whiteboard sessions, etc.

OK, it’s Splunk, so the PRDs and tech specs have always been on the corporate wiki so there’s some measure of collaboration. Anyone in the company could go up there and have a look at what was in progress. But it’s been pretty difficult to keep PRDs and specs fully up to date while we’ve been innovating as quickly as we have since the initial launch of the product in 2005. And it’s been impossible to give our customers and field sales engineering teams the level of transparency we want in order to get their full involvement.

Our public roadmap has to be created manually and is of necessity fairly high level and updated only every month or so. The other PMs and I are constantly fielding a barrage of “what’s the status of this feature?” questions.

Now that engineering is moving to a scrum-based model (read what my boss has to say about that) in order to deliver functionality quicker and more incrementally, the whole notion of a PRD is obsolete. But that doesn’t mean that product management is obsolete - in fact a rational process of analyzing inputs, setting priorities and communicating about new feature capabilities is more important than ever.

So the experiment: We’re hacking Jira, our bug tracking system, in order to automate the entire product planning and marketing process and facilitate real-time communication back to customers, internal stakeholders and even the community at large via our public roadmap.

We’re leveraging Jira’s capabilities to create custom issues and workflows in order to reproduce the essentials of pragmatic marketing’s “requirements that work” framework, the bible on effective product management. (I wish I could link to their picture but unfortunately they are so busy selling seminars the information is under lock and key.)

This means that we setting it up to automatically bring enhancement requests from our SugarCRM system into a PM work queue within Jira; asking PMs to enter call reports and market datapoints; linking all of these to problem statements; and generating granular engineeringrequirements from these problem statements. These requirements then get triaged by the cross-functional scrum teams into sprints to deliver small, complete units of functionality quickly. Features are entered as the requirements get into enough focus in order to describe complete pieces of functionality and their benefit to customers.

Beyond “requirements that work” planning, we’re going to be driving a lot of the outbound communication off this system as well. For example, when a feature’s last critical requirements are completed, we’ll be automatically opening a task for a product manager to create a demo for the feature, another to update the datasheet, etc.What’s most exciting is that once the system is tuned and we know it’s producing accurate information, we’re going to be able to give customers and the community real-time status and with the ability to give input right in the middle of the design process. Customers with enhancement requests tracked through the support portal will be able to see how they’ve been triaged, how the problem has been interpreted, and what requirements are at what stage of delivery to meet the request.The public roadmap will be maintained in real time, with the potential for drilldown into more of what’s behind each listed feature.

We’re not the only ones trying to marry agile/ scrum with pragmatic marketing. FeaturePlan is a great dedicated product for product managers that does just that. We looked at it and like it but unfortunately it’s currently Windows centric in the software version while our current internal corporate infrastructure is pretty Linux-centric, and we’re too oriented around running our own systems to use their hosted version. (Here’s a good presentation by Jason Tanner that describes using FeaturePlan in a similar way.)But I think that the level to which we’re trying to open things up to customers and the community is new ground.

My intent is to post here as we progress with this experiment as a way of tracking our progress and forcing myself to think through some of the challenges.

If you’re trying to do something similar at your company, I’d love to hear from you. I’m happy to share some of our process flows and schemas for Jira as well. Just drop me a line at cfrln@splunk.com.

Permalink   |   17 Comments

Complexity and failures in the NYT

Posted:  September 15th, 2007
Tags:  Splunk

I’ve been posting occasionally when there’s some huge meltdown of a big service like the two recent Blackberry outages. My point is usually that the systems are too complex so the failure mode is usually unpredictable and hard to track down - hence the sputtering of PR people days after big outages while sysadmins are frantically digging through logs, configs and system metrics all over the place.

Anyway, looks like the NYT picked up on the same idea. Good article citing recent outages at United and Skype and tying them into the larger problem of increasing system complexity.

It quotes Andreas Antonopolous, who’s been one of the analysts to really understand why IT Search is necessary in the face of increasing chaos and change in the datacenter. Here’s a video clip hosted on splunk.com of him talking about this.

Permalink   |   No Comments

The logs behind the Fox Fark hack

Posted:  August 23rd, 2007
Tags:  Splunk

Valleywag (the Silicon Valley Gossip site recently upgraded by means of well-known tech business reporter Owen Thomas becoming the valleywag), posted a detailed log event by log event account of the investigation by Drew Curtis, Fark’s founder, who figured out that a would-be hacker was a Fox news reporter.

The basic correlation technique is one I first heard of several years ago from an online banking hosting company’s security team - basically you figure out that the same IP address is logging into multiple accounts and probably controls both of them. The specifics are a little different but the problem is basically the same.

The trick is that email or web server logs have the IP address that hit you, with session IDs or timestamps you need to correlate to other app logs that have the user accounts.

In the Fark case this correlation showed that the account that was responsible for the bad action was the same person as an account that was identifiably that of the Fox news guy.

In the online banking case it was a way to detect phishing rings - if one Ukrainian Internet cafe’s IP hits 10 accounts at an American regional bank in an hour… probably not legit.

The online banking guys turned this logic into a proactive alerting rule, which became a key competitive weapon for them. In the Fark case it was more about the after the fact investigation.

Anyway, the Valleywag story is interesting for anyone into security log analysis.

But there’s a telling quote from Drew: “I am still collecting three or four sets of different logs together into one cohesive set. ”

You could have all those logs in one place already and have clicked your way through all the links in that chain, Drew! (hint: download Splunk!)

Permalink   |   1 Comment

$1 billion market cap loss due to service problems. Ouch.

Posted:  August 17th, 2007
Tags:  Splunk

This one’s even worse than taking Ebay’s market cap down $1 billion yesterday.
Why do outages last this long? Because it’s too hard to find out where the problem happened.

Skype finally posted that the issue was a problem in their networking code at 10 p.m. last night, about a full day after the problem started, while rumours flew around that they’d been hacked. I bet it took Skype that full day to find that the problem was with the networking code. Why? Because if Skype is anything like any other big IT operation I talk to, dozens of admins were spending the day writing and running slow one-off scripts and testing various hypotheses against log data, configurations, code, scripts and the like scattered around the thousands of servers that would be behind a service of this scale.

If you work at Skype and I’m wrong about that, please let me know. But I might not believe you.

So how should IT shops avoid this?

  • Capture all logs, configuration changes, script changes and source code revisions in real time in a central place
  • Index it all so you can search it fast
  • Put it behind a useful web interface that makes it easy for everyone in IT to navigate the data

Funny, sounds a lot like Splunk, right?

Permalink   |   No Comments