Splunk 4’s proving *everyone* can use IT data

There’s a big reason I haven’t blogged here for a while: Splunk 4. I’ve been so wrapped up in it for the last year that I haven’t really been interested in writing about anything else. Well, now it’s out, so I’m back! So I’ll kick it off with some background on why 4 is the Splunk I’ve always wanted and a little story about how my team and I have used Splunk ourselves in a new way the past few days.

The aspect of Splunk 4 that I’m most excited about is all of the ways that it makes IT data accessible to everyone, regardless of their job.

I’ve been a data fanatic since I started my first software company job 17 years ago and worked on forecasting and order management systems. I wasn’t a developer but I was able to build out quoting and forecasting systems and do in depth analysis using Filemaker Pro and Excel.

Since then, I’ve been involved in building out systems that let users analyze IT data in one form or another for 10 of the last 12 years, first running a tools team for MSN at Microsoft where my team spent $millions developing a log-driven executive dashboard, then at a pioneering log management vendor that moved from web analytics into SIEM, and the last 4 years at Splunk.

I’ve seen an unimaginable variety of functions and users that need some kind of information based on logs and other machine data. The further from software development or hands on systems administration they are, the less aware they are that the information they’re seeking is in a logfile somewhere. And even technical people who know what log it would be in may not have permission to access it.

If such an access-deprived individual is lucky, they have the power or influence to get a sysadmin to pull the data for them. If they’re not just access-deprived but technically handicapped, they also need to prevail on that sysadmin to write some scripts to massage the data into information. Then they need to trust that the sysadmin understood the business logic well enough to do the analysis right. It’s like the old story of the hungry man being handed a 3-foot long spoon.

Splunk 3 succeeded because it helped the access-deprived - which was huge in organizations hit hard with segregation of duty rules. But the non-technical user (or managers with technical chops but no time) still needed power users to run most analysis for them. Splunk 3 made it easier for technical users to fulfill the request but sysadmins still resented the distraction and savvy managers still worried about what was lost in translation.

That was as true here at Splunk as anywhere else. When we shipped 1.0, our own sysadmin kicked the tires a bit but still grepped (yes, I admit it). Somewhere around 2.x a real production setup indexing all our website server, access and error logs continuously started to get frequent usage by our web developers and sysadmins to troubleshoot problems. Yet all the time I’d sit through executive, marketing, sales, product planning and other meetings and listen to discussions where people were substituting guesses for facts - because the facts were buried in logs somewhere and our sysadmins were too busy to be burdened with one-off requests to run analysis.

As an example, I’d routinely ask Rachel, our Director of Documentation, for information about what docs topics were recently popular, trends in docs search engine referral terms, etc. as a guide to what we needed to fix in our product or processes. Sometimes I’d get the data, sometimes not, but it was always like pulling teeth. Even though Rachel and I are both technical enough to analyze a logfile the old way, we’d run into all kinds of roadblocks: switching docs platforms meant the logs stopped going to the system we were using, it was hard to set up a dashboard that we could both see, the stats we needed required analyzing more data than Splunk 3 could do on an ad hoc basis and we didn’t have the permissions to do any back end config, we can do regexes but it takes too much time to swap back into that way of thinking… Ultimately we were both busy managers that would give up and go back to executing on our core jobs, without the information we really wanted. Exactly the same stories I’d hear from Splunk 3 customers about why there were still lots of groups that could benefit from Splunk that weren’t yet doing so.

The tide turned last Friday.

In prepping for the launch I googled “Splunk 4.0″ to see if people were already talking about it online. Lo and behold! Our own beta documentation, which was supposed to be locked down to beta customers, was in the google search results. Turned out that some special pages in the docs system enabled the google crawler to get to insecure versions of our beta docs at different urls than what you’d get by navigating our docs the regular way. A typical example of an unknown vulnerability in a web application’s security, just like ones I hear of from our customers all the time.

As the business owner of this web app the next thing I wanted to know was who had seen it that shouldn’t have, what they’d seen, so I’d know whether it was a big deal or not and could decide a course of action. Too bad our daily web stats wouldn’t give me any idea of traffic that matched this very specific pattern - I’d need some custom analysis of the raw logs.

I started behaving like any hands off manager would - I started writing email to our web producer and web developer to ask them to pull the logs and do the analysis and I whipped up a storm with their bosses so they’d be given cycles to work on it. Then I stopped myself and logged into our live Splunk 4 instance instead.

I first searched for all refers to the insecure uri pattern from google.com with search strings of “Splunk 4.0″. Almost nothing. Wait - that was my search and the site I used but other crawlers could index these pages, and our pre-launch marketing used “Splunk 4″ not “4.0″. So I broadened my search to all refers to these uris from external domains to get a raw hit count - a few thousand. If I’d just gotten the total from our web guys and hadn’t been looking at the data myself I probably would have accepted the wrong answer.

So where’d these hits come from? 4.0’s new search assistant told me a common next command was “stats”. I clicked to add it to my search and I saw examples of past usage by others on our Splunk instance was “| stats count by clientip.” OK. Click. Now it suggested “lookup” (new in 4.0). Click. Now it suggested “| lookup dnslookup clientip” - sounds promising. Click. As Splunk streamed in new client IPs to build my table on-the-fly I saw familiar names pop up in the domain names - a lot of Splunk customers, one competitor.

Now I wondered what they’d seen. I couldn’t tell from this simple statistic on the initial referred request if they’d landed on one page and left, or navigated around to lots more pages. So I found (through search assistant) examples of using stats to list uris and added that to the stats command arguments.

I got my final result after just a few minutes - I had a table of results grouped by client IP and sorted in descending number of hits showing the first and last date they’d seen the special pages, their revdns hostname, the full sequence of URIs they’d viewed, and the referring domain and search query. The timeline at the top of the search view showed me that very few hits had happened before the launch webinar invitation went out. I shared an export of the results with impacted colleagues. We decided how to react based on complete information on the impact of the vulnerability. And I didn’t waste any of our web guys time while they were busy getting splunk.com ready for launch.

But that was just the first of many uses over the next few days. Yesterday, the day of the actual launch, I was more interested in keeping watch over whether initial downloaders were having a good experience, if they were just downloading or were reading the docs, and, based on the docs usage and search terms, what features they were trying first and what features may have been giving them trouble.

Now, I’ve been asking for a dashboard with this information for a while. But, tired of asking, I just went ahead and built it. I was able to put all of this information on a new docs usage dashboard and share it with support, documentation and other colleagues - all through the UI using the new report and dashboard builders and Splunk Manager.

The dashboard helped us identify some confusion around the need to upgrade to 4.x licenses which drove us to clarify the release notes and download page quickly. And now the whole docs team is enthusiastically using Splunk to better understand customers product and docs usage. They’re even planning on starting to use examples of their own usage to illustrate topics in the manuals.

I’m looking forward to seeing this tide turn for all of our customers too as others realize they can now get their own answers to all sorts of questions they used to leave unanswered.

Jira users’ group Thursday September 18

Both Dave Pickering from New Aspects and I will be at the Atlassian Jira users’ group in San Francisco next Thursday September 18, for those of you who’ve been following what we’re doing with Jira to automate product management for an agile dev organization. Looks like a lot of great Bay Area companies are going to be there.

And we really, really, are just about ready to publish the extensions and workflows we’ve done.

Details and registration.

Tell us your Splunk story at Interop

Are you planning on being at Interop in Vegas April 27-May 2? Do you use Splunk? If so, I’d love to hear from you.

I’ll be there with the Splunk video team and we’d love to record some new interviews with Splunk users. If you haven’t seen some of the user interview videos we’ve already done, check them out. They’re the best way to learn about how Splunk’s getting applied in the real world.

Some of my favorites: Demetri Mouratis, Rhythm New Media, using Splunk as an IT data platform across business and operations teams; Allen Hecker and Mark Bronniman, the senior security analyst and senior unix admin at Weill Cornell Medical College, and Trevis Edgworth of Epsilon Data Management, using Splunk for network security, compliance, insider threat and network operations.

Just email me at cfrln@splunk.com and let me know when you’re available. We’ll make it fast, and there’ll be a fine Splunk jacket on its way to you when you’re done.

P-Camp preso on automating product management with Jira

Here’s the presentation that I gave this past Saturday at P-Camp, the unconference for product managers. If you’ve been following what we’re doing here with automating product management using Jira, there’s detail and screenshots in this presentation that might be interesting.


6000 Harvard applicants’ personal data on Bittorrent

Harvard just learned security investigation 101 the hard way.

Harvard admitted yesterday that a web server was hacked a month ago that contained financial application data for over 10,000 applicants. They knew about the incident on February 15 and took down the server till February 21 in order to investigate and implement stronger security controls. Their announcement reveals how slow and ineffective security investigations often are.

“The University’s initial examination did not reveal the full extent of the hack. As the investigation continued, it became apparent that some sensitive applicant data, including Social Security numbers, could potentially have been accessed.”

Unfortunately, a day later, it was pretty obvious that over 6,000 applicants’ data had been compromised - CNet reports that all their personal data was on Bittorrent.

“Harvard officials said the data includes the applicant’s name, Social Security number, date of birth, address, e-mail address, phone numbers, test scores, previous school attended, and school records.”

Ouch.

It shouldn’t have taken Harvard nearly a month to come up with an answer as weak as “could potentially have been accessed.”

Why couldn’t they figure out for sure whether the data was accessed? Either they weren’t logging file accesses, didn’t have the logs, or the logs were too hard to analyze. Most likely a combination of all three.

Maybe they could learn from Splunk customer Weill Cornell Medical College - here’s a video of Mark Bronniman, the senior Unix administrator there, and Alan Hecker, their senior security engineer talking about using Splunk to accelerate security investigations. In fact, they implemented Splunk first to speed up an investigation that was in progress.

Product management nirvana

A few months ago I wrote about our effort to automate and open up product planning by implementing a process around distilling product inputs into requirements using Jira in support of an agile/scrum based development model. I’ve rarely had so much response to a post… dozens of product managers at companies large and small wrote me and commented about their own efforts along the same lines. Many asked for our specs on our Jira customizations.

We were at the beginning of this effort when I wrote that post. In the intervening 3+ months we’ve completed the first round of Jira customizations (thanks to lots of help from Dave Pickering and the team at New Aspects of Software, a fantastic consulting firm specializing in Jira - these guys do what they say they’ll do, when they say they’ll do it, for the amount of money they said they’d charge.) My tireless PM teammates have been embracing the new system and putting in the late nights to coalesce all of the feedback into common problem statements and requirements.

The work all came together for us this past week as we head into the next round of product planning and are reforming our scrums and confirming business priorities - we had a hefty but complete “PRD” that was automatically generated from Jira and represented a comprehensive view of product requirements and concepts. The PM team took about 4 hours to walk through it, confirm our initial priority cut, and then we had an incredibly productive series of sessions with the full business and product leadership to decide which problems to tackle next, how to reform the scrum teams, and what priorities to give each scrum to start with. This was the most ordered and efficient product priority setting exercise I’ve ever been through, because we were dealing with the complete picture.

The PRD report was a custom report built for us by New Aspects. It lets you filter our custom “problem statement” issue type by priority, text matches and other fields; then it prints all problems in reverse priority order. An example of a problem statement would be something like “Splunk doesn’t support the fibberziggy filesystem”, with all ERs asking for fibberziggy filesystem support linked to that problem statement. Each problem shows other linked issues including:

  • Inputs: Enhancement requests, Market data points, Call reports (each with details like customer name, deal value, etc.)
  • Requirements - with details of status, so we could understand where we were on problems that were already partially addressed by past development
  • Features
  • Child problem statements (they cascade)

Now the scrum teams are off to do their individual sprint planning and requirements development, with all of that to be captured in Jira as they go. The common “PRD” will stay up to date as they work independently, and best of all, our SEs can see all the way from their individual customer enhancement requests through to up-to-the-minute status of requirements definition and completion. Hardly the case with the old document based PRDs PMs used to create.

Next up we’re going to tackle automating cascading updates based on requirements status updates. For example, if QA validates a completed requirement that Splunk lock test succeeds on the fibberziggy filesystem, we want Jira’s workflow to check that this was the last requirement for the problem “Splunk doesn’t run on fibberziggy filesystems”, close that problem, then check to see if that problem was the last problem for each linked enhancement request, and close those enhancement requests and update our Sugar CRM system via our email integration. We even want interim updates, such as flowing back when we’ve fully scoped requirements for a problem.

We’re also talking to New Aspects about packaging up all our custom Jira reports, workflows and security schemes and giving it to the community, so look here for a post when it’s ready for download.

Facebook, privacy and IT data

Facebook is getting a lot of flak in the press (latest in the Register) about reports on a gossip blog about some pretty serious privacy holes:

1. anyone that works there can look at anyone’s private profile

2. anyone who works there can look at logs of what other profiles any user has seen.

If Facebook wants to turn their act around, or any other social networking site wants to avoid being in their position, they’d better pay attention to some best practices around securing and reviewing IT data.

Here’s what best practice would say about Facebook’s two problems.

The first problem - anyone can look at any customer’s data - is classically the kind of thing that has brought on regulations in other industries, such as PCI-DSS, which was introduced by VISA to ensure that merchants processing credit cards keep consumer financial info private. Like credit cards, a lot of the information people post to their private profiles is a goldmine for identity thieves - Information Week made this argument about Faceboook even before the latest flap. If I know your birthdate and mother’s name I’m a lot further along in social engineering an unwitting customer support rep into believing I’m you. And yes, identity thieves do have insiders - ask Ford Motor Credit.

A major measure that organizations who are following best practices for privacy are supposed to take is to lock down this private information to only insiders with a need-to-know - obviously Facebook’s not doing that. But once they do put the right access controls in place, they’re going to need to put in a review procedure to watch privileged employees. Facebook’ security or privacy staff should be reviewing logs of who has accessed private info and ensuring that there was a valid business reason for each access. The review should include:

  • logs generated by Facebook’s application itself to see employees with admin access coming in the front door
  • audit tables for the back end databases to be sure that the database admins who manage the database back-end aren’t bypassing the application’s permissions and doing manual queries to see what they shouldn’t
  • filesystem audit logs, to be sure that server or storage admins aren’t bypassing both the database and the app to look at the data on the filesystem itself

The second problem - that any employee can look at logs of what users have done - is a bit less well understood privacy issue. It’s probably particularly bad on a social networking site - do you really want your ex knowing you’re watching their profile? But you may not want every Amazon employee being able to see what items you’re browsing, so it’s an issue that affects almost any site to some degree.

To address the second issue, logs themselves need to be securely captured into a system that provides appropriate access controls to the logs themselves as well as an audit trail of who’s looked at the logs - which the security team should be reviewing proactively. Unfortunately, access logs are hardly ever considered to have privacy implications inside large sites. As evidenced by last year’s infamous publication of AOL search records.

Keeping these logs around that show who looked at what is going to be important too - law enforcement could subpoena Facebook for logs if unauthorized access by their employees is suspected to be a part of a criminal act. Facebook won’t want to be in a position where they can’t produce the logs.

The biggest reason Facebook should take this seriously? An overzealous plaintiff’s attorney somewhere is probably salivating over all the cash they raked in from Microsoft and figuring out how to sue Facebook for cash damages if a Facebook privacy breach leads to financial losses or serious personal harm, using the argument that by not following the same standard as other sites they’ve not met their “duty of care.” Think they can’t do it? TJ Maxx is getting sued right now on similar grounds.

Splunk as job qualification

This is a fun trend for us here at Splunk - more and more job descriptions are listing Splunking skills as a plus. Really rewarding for those of us who’ve been here since before the 2005 beta!
Here are a few jobs that want you to know your Splunk:

Got any more? Post ‘em in the comments!

Automating and opening up product planning

The PM and engineering teams are embarked on an interesting experiment here at Splunk. While we’ve always leveraged the support case system to track enhancement requests and automate some of the input end of the product management process, the real meat of product definition has happened pretty much as it does anywhere - via product requirements documents (PRDs) written by PMs and answered by a variety of technical specifications, bugs and tasks in the engineering tracking system, emails, whiteboard sessions, etc.

OK, it’s Splunk, so the PRDs and tech specs have always been on the corporate wiki so there’s some measure of collaboration. Anyone in the company could go up there and have a look at what was in progress. But it’s been pretty difficult to keep PRDs and specs fully up to date while we’ve been innovating as quickly as we have since the initial launch of the product in 2005. And it’s been impossible to give our customers and field sales engineering teams the level of transparency we want in order to get their full involvement.

Our public roadmap has to be created manually and is of necessity fairly high level and updated only every month or so. The other PMs and I are constantly fielding a barrage of “what’s the status of this feature?” questions.

Now that engineering is moving to a scrum-based model (read what my boss has to say about that) in order to deliver functionality quicker and more incrementally, the whole notion of a PRD is obsolete. But that doesn’t mean that product management is obsolete - in fact a rational process of analyzing inputs, setting priorities and communicating about new feature capabilities is more important than ever.

So the experiment: We’re hacking Jira, our bug tracking system, in order to automate the entire product planning and marketing process and facilitate real-time communication back to customers, internal stakeholders and even the community at large via our public roadmap.

We’re leveraging Jira’s capabilities to create custom issues and workflows in order to reproduce the essentials of pragmatic marketing’s “requirements that work” framework, the bible on effective product management. (I wish I could link to their picture but unfortunately they are so busy selling seminars the information is under lock and key.)

This means that we setting it up to automatically bring enhancement requests from our SugarCRM system into a PM work queue within Jira; asking PMs to enter call reports and market datapoints; linking all of these to problem statements; and generating granular engineeringrequirements from these problem statements. These requirements then get triaged by the cross-functional scrum teams into sprints to deliver small, complete units of functionality quickly. Features are entered as the requirements get into enough focus in order to describe complete pieces of functionality and their benefit to customers.

Beyond “requirements that work” planning, we’re going to be driving a lot of the outbound communication off this system as well. For example, when a feature’s last critical requirements are completed, we’ll be automatically opening a task for a product manager to create a demo for the feature, another to update the datasheet, etc.What’s most exciting is that once the system is tuned and we know it’s producing accurate information, we’re going to be able to give customers and the community real-time status and with the ability to give input right in the middle of the design process. Customers with enhancement requests tracked through the support portal will be able to see how they’ve been triaged, how the problem has been interpreted, and what requirements are at what stage of delivery to meet the request.The public roadmap will be maintained in real time, with the potential for drilldown into more of what’s behind each listed feature.

We’re not the only ones trying to marry agile/ scrum with pragmatic marketing. FeaturePlan is a great dedicated product for product managers that does just that. We looked at it and like it but unfortunately it’s currently Windows centric in the software version while our current internal corporate infrastructure is pretty Linux-centric, and we’re too oriented around running our own systems to use their hosted version. (Here’s a good presentation by Jason Tanner that describes using FeaturePlan in a similar way.)But I think that the level to which we’re trying to open things up to customers and the community is new ground.

My intent is to post here as we progress with this experiment as a way of tracking our progress and forcing myself to think through some of the challenges.

If you’re trying to do something similar at your company, I’d love to hear from you. I’m happy to share some of our process flows and schemas for Jira as well. Just drop me a line at cfrln@splunk.com.

Complexity and failures in the NYT

I’ve been posting occasionally when there’s some huge meltdown of a big service like the two recent Blackberry outages. My point is usually that the systems are too complex so the failure mode is usually unpredictable and hard to track down - hence the sputtering of PR people days after big outages while sysadmins are frantically digging through logs, configs and system metrics all over the place.

Anyway, looks like the NYT picked up on the same idea. Good article citing recent outages at United and Skype and tying them into the larger problem of increasing system complexity.

It quotes Andreas Antonopolous, who’s been one of the analysts to really understand why IT Search is necessary in the face of increasing chaos and change in the datacenter. Here’s a video clip hosted on splunk.com of him talking about this.

Next Page »