thebaumblog: Man Versus Machine

Human and Machine Language Mashups at Splunk Live Zurich, Switzerland

At Splunk Live in Zurich this week an interesting discussion erupted about human and machine languages. Before I continue with the story, I want to thank everyone that attended the event. Despite the fact that Raffy Marty is a resident celebrity, this was our first formal customer and partner event in Switzerland. We had more than 50 people attend for several hours to talk about Splunk and data center management challenges. The event was co-hosted by T-Systems.

Thank you Meno Schnapauff for your great presentation on how T-Systems and the Swiss National Railway are using Splunk!

Other attendees included folks from Swisscom, Unicom Consulting, Rothschild Bank, Genossenschaft Migros, LeShop, Netcetera, Cablecom GmbH, TBK-Patent Munich, On Line Video 46, Skyguide, PostFinance and the Univestity of Fribourg. Brian Haynes, Tim Thorpe, Julie Duncan and Hash Basu-Choudhuri from our London office participated too.

Now part of the reason I mention all these names (in addition to thanking folks) is to the point of this post. In the room we had an American (me), several native English speakers from different areas of England, Swiss German speakers from Switzerland and German speakers from Germany. What I noticed is how two people think they speak the same language but can’t always understand each other. It turns out there are a lot of American (some West Coast) colloquialisms I use that my “queens English” counterparts don’t understand. And of course most of the time I try to make a joke the Swiss and Germans just look at me like I’m from outer space even though if you asked them they’d say they speak fluent English. During the event the Swiss Germans had trouble understanding the Germans and the Germans had trouble understanding the Swiss Germans. The folks from the UK who spoke German didn’t understand either the Swiss German or the German German although they all claim to speak German.

What does all this have to do with IT you ask? Well it turns out that mashing up languages and attempting to understand each other even though we don’t speak exactly the same language is one of the biggest problems we have in trying to understand our IT systems as well.

“One of the questions posed at the event was how can I modify my system and application logging to some standard in order to follow what my systems are doing? Do we need a logging standard?”

I have long been telling people that logging standards are a waste of time. IBM’s Common Base Events (CBE) has been around for decades and has very little traction in the real world. Data Center Mark-up Language (DCML) was pushed by Opsware and lots of smart people. It got nowhere. Logs exist. Instrumentation exists. Our IT systems already have tremendous amounts of data. Trying to retrofit that data to some standard is impossible. Attempting to organize a multi-vendor logging standard will never happen. Getting developers to log consistently sounds great but I’ve never seen it done before.

What we need is a mashup of machine languages and logging formats. That’s exactly what IT Search is!

Humans need to stop thinking about how we can format data to make it easier for machines to work with it. There is too much data. The real value is being about to work with massive amounts of data without any human intervention. This is exactly what Google does for the web. Sure you can reformat your HTML to get better search results. But even if you do nothing Google will index your site. You don’t even have to tell Google to do it!

I’m going to start sharing more of our experiences helping people see the connections that already exist in their logging data. While the connections are not always obvious to the naked eye and human linear thinking, machines are great at teasing out non-obvious relationships. This is perhaps the most compelling thing we work on at Splunk and continue to push the bleeding edge of what’s possible.

Life after SIEM. Situational Awareness is next.

We’ve been hearing a lot lately about the death of SIEM technologies. But isn’t the question less about a legacy technology dying and more about the dimensions on which the next mass adopted security capability will be born? Clayton Christensen first described a model for disruptive technology in his book The Innovator’s Dilemma and his follow on The Innovator’s Solution. Christensen describes a theory about how disruptive technologies over take sustaining technologies by delivering value on new dimensions that established vendors overlook as unimportant, low end or just don’t think about because they’re too busy improving their legacy. Christensen’s work offers an interest framework to think about what’s taking place in the market for SIEM security management solutions.

Any enterprise trying to secure their IT infrastructures knows the state of the art in SIEM security approaches falls short. And trends like virtualization are making things even more difficult. System and security administrators and analysts are inundated with too many potential incidents and its too difficult and time consuming to investigate even a fraction of them. Achieving a greater comprehension of the meaning of potential incidents and the projection of their status in the near future is the real goal. The idea, called “situational awareness” is often, however, impossible to achieve. We are so dependent on pre-programed rules in our SIEM solutions that we lack the ability to perform our own analysis because the original raw data has been filtered out, thrown away or we have no practical way to make sense of it.

Observation: If the technology is sufficiently complex as to allow the vulnerability to exist, can we really build complex technology to catch all the possible issues or scenarios?

As a reference point see David Hazekamp, Security Architect at Motorola, talk about the importance of retaining all security data across the Motorola global SOC infrastructure and integrating access to all this data into existing SIEM solutions.

Of course reaching this understanding requires one suspends their disbelief about the effectiveness of current SIEM security technologies. Usually this means you’re not a vendor or you’re a vendor with little or no vested interest in current approaches. So with this let’s examine the typical enterprise deployment of security technologies.

Defense in Depth

This is where every good enterprise security architecture starts. In order to begin securing your environment you’ve got to have data, raw data. In most data centers this takes the form of syslog from network devices and servers, SNMP traps, OPSEC or LEA interfaces for firewall events, WMI for Windows desktop and server events, IDS and IPS signature scans and application level firewall examination of common services like FTP, HTTP, SFTP, SCP etc. The thinking is you need to look at everything. Perhaps you’ll even want to pull in information from physical security systems like badge readers.

Security Information Management (SIM)

The next step in the process is to manage all this raw data and filter it down to a manageable number of events, traps and alerts. Collecting, storing and providing some basic analysis on all this data is the job of a SIM. Typically, as Raffy points out, the data is parsed, normalized and stored in a structured RDBMS. Parsing, normalizing and structuring all this data is great if the data doesn’t change or you don’t have too much of it. But if you’re dealing with data formats that aren’t static or you’re trying to store terabytes of this data an RDBMS won’t be your friend.

Security Event Management (SEM)

Once a SIM has done it’s job you’re ready to aggregate, correlate and start reporting on potential incidents using a SEM to do the job. SEM’s usually consist of lots of rules that look for combination and patterns of events indicating that a possible attack or breach may be underway. Essentially the SEM rules attempt to codify what we humans know about vulnerabilities in our IT systems and possible ways to exploit them. The goal is to provide some real-time information usually in the form of reports, dashboards and visualizations to operations and security analysts who work to keep the infrastructure secure.

Situational Awareness (SA)

SIEM correlation can be interesting for discovering a pattern or related event but the ability to work an issue outside of these “canned” rules and events becomes the real problem. Unfortunately, what all to often happens is there are so many possible attacks, operations and security staff are overwhelmed with potential incidents to investigate and not every event or pattern of interest is going to be discovered via the pre-built rules. Situational awareness is the attempt to perceive environmental elements within a volume of space and time. Comprehension cannot be achieved if the data being bubbled up is filtered according to a set of rules and the technology does not allow a human to perform their own analysis of the raw data as generated by the environment itself. All technologies have their weaknesses and those that perform correlation are no different.

Thus whilst canned SIEM correlation provides value in bubbling things up — we still need the ability to dig into the raw data to fully perceive and comprehend what is taking place. Now mind us all SA is not a new concept. It has been applied rather robustly by decision-makers in complex, dynamic areas from aviation, air traffic control, power plant operations, military command and control — to more ordinary but nevertheless complex tasks such as driving an automobile or motorcycle. And yes it has been mentioned before in security operations, particularly in government agencies.

Man Versus Machine: Part One

Recently I gave a talk at the BT annual technology gathering. The setting was a really beautiful estate called The Grove just north of London in Hertfordshire England. A couple hundred of BT’s smartest technology managers were in attendance and I was supposed to think of something to hold their interest for an hour. I got to thinking about all the technology and infrastructure BT must have and how in the world do they manage it. I started gathering data. With internal growth, new projects like BT’s 21st Century Network and acquisitions over the past decade through BT Global Services outsourcing contracts the company has a lot of IT infrastructure.

  • 74 data centers,
  • 163 countries,
  • 3,000 applications,
  • 6,000 different types of systems/devices and
  • 17,000 IT staff (6,000 BT and 11,000 outsourced).

I also spent a few hours with some of BT’s brightest architects who are working on attempts to virtualize every layer of their infrastructure — network, storage, database, application, web servers, VoIP, collaboration, ordering, billing, provisioning, monitoring etc. What’s their biggest problem I asked. Resoundingly it was “our customers are still often the ones that tell us stuff is broken.” This was so reminiscent of my time at places like Yahoo! where we’d have these 7×24 war rooms during key outages and the daily conference calls with 30-40 people on the line all emailing logs and configurations to each other.

As our IT infrastructures become incredibly complex, dynamic, service oriented, virtualized and mission critical we’re confronted with this battle raging in our data centers. And it appears the machines are winning and the humans are losing.

Our biggest problem is figuring out — did something go wrong? Why? Where does truth lie? According to market researcher IDC In 2007 > $140B spent managing the world’s data centers. IT OPEX is growing at 2.5 times the rate of hardware spend and 1/3-1/2 of TCO is spent recovering from problems. The cost of availability now dwarfs the purchase and maintenance cost of technology.

So what have we as an IT industry done to address the problem?

We’ve created concepts like ITIL and CMDBs. While there are some good processes improvements here for sure, these top down modeling approaches and pre-determined rules only tell us what we already know. In my experience it is not the things we already know about that bite us in the ass and take our systems down for prolonged periods of time. It’s the multitude of unanticipated and unavoidable dependencies and interactions that take place in an complex system. And it’s impossible to know what set of dependencies and interactions will cause downtime until it occurs. Our infrastructures are just too indeterminate. That’s the point after all. Tier it, load balance it, virtualize it. So we don’t have to worry about the dependencies and interactions among all the different components. Well guess what? We do have to care. Because we have to fix it when it goes wrong.

Take the analogy of a complex air traffic control system. Sure the air traffic controllers feel really great when they arrive at work in the morning. They’ve got their coffee, flight plans and a good handle on the early morning inbound and outbound traffic.

flightplan

Then the day gets a bit more challenging. Weather conditions over Chicago backs up landings at O’Hare. A baggage handler and mechanic strike slows down JFK departures. A pilot radios he’s three degrees north over Pennsylvania but where is he really? Now you need radar. Throw the flight plans out the window. You needs to know what’s actually happening now.

radar

So how do we establish the equivalent of radar for a complex IT infrastructure. Component monitoring doesn’t work any more. If the problem is a single component failure, we already know about it. We’ve already automated the swapping in of a new machine or device. And we can reboot software components automatically. IBM’s has their own marketing play on this called “Autonomic Computing” but that too seems to only focus on the simple single component issues not the indeterminate chaos that ensues in a real running system. And it seems like more slideware than real solutions.

In my next post I’ll tackle the issue of how we might look at things differently.

Stay tuned.

Welcome!

I’m Michael Baum. Welcome to my blog.

I hope to find time to write about some of my favorite topics including:

  • Splunk and IT Search.
  • Technology gadgets and software — the stuff we all like to use.
  • Datacenter applications, servers, networks and security — the stuff we all have to keep running.
  • Business, entrepreneurship and venture capital.
  • Wall street and investing.

Comments are always welcome and you can also reach me via email at thebaum (at) splunk (dot) com.