thebaumblog: Archive for August, 2008

Man Versus Machine: Part One

Recently I gave a talk at the BT annual technology gathering. The setting was a really beautiful estate called The Grove just north of London in Hertfordshire England. A couple hundred of BT’s smartest technology managers were in attendance and I was supposed to think of something to hold their interest for an hour. I got to thinking about all the technology and infrastructure BT must have and how in the world do they manage it. I started gathering data. With internal growth, new projects like BT’s 21st Century Network and acquisitions over the past decade through BT Global Services outsourcing contracts the company has a lot of IT infrastructure.

  • 74 data centers,
  • 163 countries,
  • 3,000 applications,
  • 6,000 different types of systems/devices and
  • 17,000 IT staff (6,000 BT and 11,000 outsourced).

I also spent a few hours with some of BT’s brightest architects who are working on attempts to virtualize every layer of their infrastructure — network, storage, database, application, web servers, VoIP, collaboration, ordering, billing, provisioning, monitoring etc. What’s their biggest problem I asked. Resoundingly it was “our customers are still often the ones that tell us stuff is broken.” This was so reminiscent of my time at places like Yahoo! where we’d have these 7×24 war rooms during key outages and the daily conference calls with 30-40 people on the line all emailing logs and configurations to each other.

As our IT infrastructures become incredibly complex, dynamic, service oriented, virtualized and mission critical we’re confronted with this battle raging in our data centers. And it appears the machines are winning and the humans are losing.

Our biggest problem is figuring out — did something go wrong? Why? Where does truth lie? According to market researcher IDC In 2007 > $140B spent managing the world’s data centers. IT OPEX is growing at 2.5 times the rate of hardware spend and 1/3-1/2 of TCO is spent recovering from problems. The cost of availability now dwarfs the purchase and maintenance cost of technology.

So what have we as an IT industry done to address the problem?

We’ve created concepts like ITIL and CMDBs. While there are some good processes improvements here for sure, these top down modeling approaches and pre-determined rules only tell us what we already know. In my experience it is not the things we already know about that bite us in the ass and take our systems down for prolonged periods of time. It’s the multitude of unanticipated and unavoidable dependencies and interactions that take place in an complex system. And it’s impossible to know what set of dependencies and interactions will cause downtime until it occurs. Our infrastructures are just too indeterminate. That’s the point after all. Tier it, load balance it, virtualize it. So we don’t have to worry about the dependencies and interactions among all the different components. Well guess what? We do have to care. Because we have to fix it when it goes wrong.

Take the analogy of a complex air traffic control system. Sure the air traffic controllers feel really great when they arrive at work in the morning. They’ve got their coffee, flight plans and a good handle on the early morning inbound and outbound traffic.

flightplan

Then the day gets a bit more challenging. Weather conditions over Chicago backs up landings at O’Hare. A baggage handler and mechanic strike slows down JFK departures. A pilot radios he’s three degrees north over Pennsylvania but where is he really? Now you need radar. Throw the flight plans out the window. You needs to know what’s actually happening now.

radar

So how do we establish the equivalent of radar for a complex IT infrastructure. Component monitoring doesn’t work any more. If the problem is a single component failure, we already know about it. We’ve already automated the swapping in of a new machine or device. And we can reboot software components automatically. IBM’s has their own marketing play on this called “Autonomic Computing” but that too seems to only focus on the simple single component issues not the indeterminate chaos that ensues in a real running system. And it seems like more slideware than real solutions.

In my next post I’ll tackle the issue of how we might look at things differently.

Stay tuned.

Splunk Live Southwest 2008

This week we’ve been moseying through the Southwestern part of the US with our Splunk Live show. We changed up the format a bit with Splunk technical workshops in the morning and customer round tables in the afternoon. The technical workshops were a big hit with more than 200 people registered to engage with our Splunk Experts. During the workshop you were able to download, install, configure and start using Splunk on your laptop or server with remote access. The best part about Splunk Live events though is sharing ideas with other Splunk fanatics.

Ryan Peterson from Infusionsoft, a marketing automation company, gave a great talk in Scottsdale about his Splunk deployment for the company’s email infrastructure. Ryan is tasked with keeping more than 12M emails a week flowing out of the system to support Infusionsoft’s Automated Follow-up Technology (AFT). Ryan has multiple servers in different geographies in addition to PCI Compliance requirements. He demonstrated using Splunk to troubleshoot problems spread across the messaging infrastructure, address reporting inaccuracies and deliver PCI reports to auditors. He’s even indexing the content of email with Splunk using a scripted LDAP data input. Cool stuff.

In San Diego Tony Doan of the Genomics Institute at the Novartis Research Foundation (GNF) and Eric Van Johnson from Sony Consumer Electronics joined us. Tony is a security engineer and former pen tester. He also confesses to be a recovering Unix sysadmin. GNF has 600 Windows desktops and several hundred Windows and Linux servers supporting the discovery of new biological processes and improved human therapeutics. Tony discussed how they splunk Cisco CSC, Bluecoat, Symantec AV, Arpwatch, Cisco Switches and Wifi access points to find what he calls “previously unknowns” to improve operational availability and security. He says they’re finding new uses everyday but Tony’s favorite is splunking Cisco IPS and Cisco MARS events looking for odd behaviors. Next up for GNF is eating Windows Event Logs and Windows Registry inputs together with summary indexing for consolidated reporting.

Eric Van Johnson is the eServices Hosting and Operations Manager at Sony Consumer electronics. He led an great discussion on splunking IBM Websphere and MQ Series events including how Sony has integrated operations and development environments to identify problems with complex apps more quickly and avoid unnecessary escalations to the development team. He shared with us Sony’s roll out of Splunk to their Business Intelligence Group. The idea is to complement aggregated WebMethods data reporting for business activity monitoring. Next up he wants to feed Splunk data back and forth with Verizon’s hosting operations since some of the Sony servers are hosted at Verizon and Verizon is also using Splunk.

In LA Rich Horace, Director of Systems Engineering and Operations at Fox Interactive Media demonstrated how Fox uses Splunk in the Fox Audience Network. Basically these are the guys that serve web advertisements across all the Fox properties including MySpace, Rotten Tomatoes, Fox Sports and IGN. He’s challenged with launching new monetization platforms and keeping the existing ones running. Rich gave a fantastic overview of his Splunk installation which consolidates/aggregates data form disparate systems in order to protect against hackers and meet PCI and SOX requirements. He currently runs an environment with ~600 Linux servers, load balancers, servers, NetApps and network switches. So far he’s indexed 1.5B events. We engaged with everyone in a lively discussion about securing production sites from developers and controlling and auditing access to data using Splunk’s access controls and search filters. Rich also discussed how Fox is using Splunk to integrate with various Citrix products including Netscaler and XenApp.

Thanks to everyone who shared their stories with us this week, it was really awesome.

Splunk Developer Camp 2008

It’s Sunday night before the start of our first ever Splunk Developer Camp. Never before have we invited developers from our community at large to participate in sharing their ideas about building Splunk Apps and learning about all the cool stuff in our upcoming releases. I think I can speak for everyone at Splunk when I say we are truly amazed with the level of interest and participation. We’ve had to move the venue three times now to accommodate the growing list of participants and while we initially expected the mix would be mostly existing customers, we’re really pleased with the mix of developers coming tomorrow.

  • 125 Developers
  • 91 Organizations
  • 26 Industries
  • 9 Countries

Only a third of the developers showing up are customers. The rest are system integrators, MSPs, OEMs, ISVs and VARs.


Post Camp Update

We’ve organized the day into a combination of an un-conference format with developer round tables, sneak peaks of future versions of Splunk, demos, demos, demos from customers and partners and training on the Splunk API and SDKs. Our goal for the day was to both educate campers on how to effectively build Splunk apps and to get everyone jacked up about the possibilities. We broadcast the sessions live on Splunk TV.

The day started with a quick intro by me. I gave everyone a brief Splunk history lesson of the past five years and demos of the Splunk for PCI and Splunk for Server Virtualization applications. I wrapped with a discussion of our strategy to seed Splunk everywhere and to enable developers to distribute their applications to Splunk installations around the world in the near future. More on this in a future post.

Erik Swan and Rob Das, my two co-founders followed with a more in-depth evolution of Splunk chat which many focused on all the weird prototypes and company names we thought of before the real Splunk. Some of it is funny and some down right scary. Amazing what guys out of a job can come up with.

Konfabulator Follow Along

Next up Kord Campbell, Director of our Developer Program gave an overview of agenda for the day and reviewed how to register with the Konfabulator and follow along with the many demos up on our SplunkLabs EC2 server at Amazon Web Services. This worked great as everyone could build and run the demos on their own EC2 instance. Kord also showed off the new Splunk Wiki for developers and application users. We’re in the process of moving all our documentation to the wiki as a one stop shop for information on using, administering, deploying and developing for Splunk. A few other Kord matters included the review of our new Developer Program additions including a 2GB Developer Enterprise License for registered developers.

Splunk Apps

Jef Bekes, our Head Designer and Raffy Marty our Application Product Manager then gave a very inspiring talk about the future of Splunk and Splunk Apps. The basic point being in Splunk 3.3 today there is no sense of application context. This means the same default user-interface for all applications and that all knowledge (saved searches, alerts, reports etc.) is shared across all installed apps. It’s impossible also to “switch” from one app to another. Splunk 4.0 attempts to address this whole problem by making applications first class objects that can be containers for collections of other objects at the interface, knowledge and configuration layers. As more an more Splunk applications arrive on the scene this encapsulation becomes increasingly important. Jef and Raffy showed a sample Splunk 4.0 Help Desk application that included custom branding, restricted task-based navigation and structured search user interfaces and results views. Other Splunk 4.0 features were reviewed too; Splunk Web gadgets, the Application builder, improved charting and content grouping.

Developer Platform and API

The Splunk Developer Platform futures was up next with Tom Donahoe, Splunk Product Manager and Johnvey Hwang Lead UI Developer. Topics included the Splunk 4.0 improvements like Application Builder, REST API Additions, UI Extensibility and SDK Support. The Application Builder eases application creation and packaging dramatically improving the experience beyond where Splunk 3.3 currently stands. The Application Builder will be available in both command-line and GUI to provides application configuration isolation and leverage file system security controls. Johnvey reviewed with us planned REST API additions for 4.0 like

  • Alerting: history, status, improved generation
  • Notifications: email, RSS
  • Search scheduling management
  • Knowledge management
  • Authentication: users, roles, single sign-on
  • Distributed: topology data, server metrics

Splunk Ninja

The Splunk Ninja (aka Michael Wilde) graced us with a visit and showed off his demo Godness with a Zero-to-Lightspeed set-up and data eating with the new Splunk Crawl feature in 3.3. Sweet!

Search Language

David Carasso, a Senior Developer and Alex Raitz one of our Solution Architects did a fantastic overview of the Splunk search language and ran through some really cool examples of powerful stuff like

  • What’s the most important hard disk error on each of my hosts?
  • Who sent me the most email?
  • How long do users stay on my website?

David showed us how to create our own search commands too. Awesome stuff.

Large Scale Reporting and Summary Indexing

Steven Sorkin, Head Indexing Geek led a wonderful talk on large scale reporting using great examples like finding violations in security data on application layer firewalls and routers. He covered how we use map/reduce models to summarize batches of events - what we call summary indexing. It turns Splunk into a sort-a time slinky.

REST/ATOM API and Splunk Gadgets