thebaumblog: logging

Conficker is Proof We Need to Log Broadly and Analyze Deeply

At RSA this week it’s easy to got lost in the menagerie of security technologies to conquer malware proliferation, stomp out spam and protect virtualized and cloud computing environments. But the most recent statistics show we are still losing the war on cybercrime. Symantec’s latest Internet Security Threat Report sited 1,656,227 malicious-code threats last year and 75,158 new active bot-infected computers per day. And yes the United States is still the most frequently targeted by denial-of-service attacks accounting for 51% worldwide and the top country for underground economy servers advertising stolen credit cards accounting for 67% of all activity worldwide.

Why are we losing so badly? Not surprisingly, there was a lot of talk at RSA about the Conficker worm. Some of the chatter points to reasons why the security industry is falling behind. At first glance, the Conficker worm looks harmless. So far there are not too many significant reports of infected machines and hijacked data,
but it may be too early to feel so smug about it. The worm’s real danger is its demonstrated ability to evade the expensive IDS technology enterprises have put into place and rely on today. Estimates are that 90% of the enterprise IDS implementations have failed to detect the worm’s presence and create some kind of actionable alert. How can this be?

Conficker properties are simple but different from the typical threat. First Conficker affected systems outside of IDS coverage like USB keys and mobile user laptops. So if you’re looking for attacks from outside your network only, you won’t see it. It’s a “walk-in virus”. Second it isn’t greedy like Code Red and other viruses of late. The Conficker worm has built-in sleep cycles. So where a typical worm might scan 1,000 or 10,000 IPs a minute, Conficker was happy to scan maybe say 100 and evade the baseline trip wires. Third Conficker is very selective with its payload delivery. It only delivers when it sees a vulnerability. All this helps Conficker evade IDS systems that want to witness the crime. But Conficker is the perfect crime in that it goes undetected. With no payload delivered and seemingly fewer IPs scanned there is no grossly abnormal behavior to witness. The evidence is circumstantial.

At a lunch on Wednesday, Tom Le of BT gave a good overview of how BT Managed Security Services detected Conficker for their customers. It was one of the first times I’ve really been sold on a managed security service beyond the value of cost and convenience.

First, as Tom explained it, they started by assuming IDS would miss the attack. They didn’t assume a payload had to be delivered and didn’t assume that large number of scans were needed to indicate the presence of an intruder. Instead of depending on IDS, BT uses logs and events to baseline the natural behavior of even netbios triggered scans (which Conficker happened to use) and was able to alert on small changes in scans that would be missed if you were only looking at things like netflow. As it turns out most firewalls blocked the netbios scans going out so again most customers didn’t even know they had the Conficker worm present.

Second Tom and his team assumed some type of command and control activity associated with Conficker. They followed the money watching for things like confikur trying to phone home in different ways. By having a broad set of logs and events from switches, routers, applications and IDS they were able to look for outlying behaviors like DNS lookups to obscure locations not typically seen in customer networks and aggregate this information across customers to identify common abnormalities. Tom estimates that BT sees roughly five billion messages a week across their customer base. That’s a lot of messages.

After listening to all the chatter about Conficker and walking the show floor, it gets easier to understand how criminals continue to evade the security infrastructure enterprises put in place. There are just too many ways in which breaches can occur and there is just too much data scattered about to collect and correlate in order to find the anomalies. So the security industry continues down the path of specific solutions to specific vulnerabilities and criminals continue to create new threats that evade the industry’s point approaches. I say the industry as a whole needs to move to more of an adaptable and flexible approach that can apply security to what ever threats arise, when they appear.

The best real world detectives are able to piece together seemingly circumstantial evidence and sift out the clues that lead to catching criminals. But every time it’s different. Perhaps we need to take the same approach in order to obtain more adaptable security solutions. Assume every time it’s different not the same.

Logging broadly and analyzing deeply is one of the best defenses. Without a broad swath of data you won’t have the pieces of the puzzle to put together at the moment you need to solve the crime.

Few criminals are caught in the act.

Man Versus Machine: Part One

Recently I gave a talk at the BT annual technology gathering. The setting was a really beautiful estate called The Grove just north of London in Hertfordshire England. A couple hundred of BT’s smartest technology managers were in attendance and I was supposed to think of something to hold their interest for an hour. I got to thinking about all the technology and infrastructure BT must have and how in the world do they manage it. I started gathering data. With internal growth, new projects like BT’s 21st Century Network and acquisitions over the past decade through BT Global Services outsourcing contracts the company has a lot of IT infrastructure.

  • 74 data centers,
  • 163 countries,
  • 3,000 applications,
  • 6,000 different types of systems/devices and
  • 17,000 IT staff (6,000 BT and 11,000 outsourced).

I also spent a few hours with some of BT’s brightest architects who are working on attempts to virtualize every layer of their infrastructure — network, storage, database, application, web servers, VoIP, collaboration, ordering, billing, provisioning, monitoring etc. What’s their biggest problem I asked. Resoundingly it was “our customers are still often the ones that tell us stuff is broken.” This was so reminiscent of my time at places like Yahoo! where we’d have these 7×24 war rooms during key outages and the daily conference calls with 30-40 people on the line all emailing logs and configurations to each other.

As our IT infrastructures become incredibly complex, dynamic, service oriented, virtualized and mission critical we’re confronted with this battle raging in our data centers. And it appears the machines are winning and the humans are losing.

Our biggest problem is figuring out — did something go wrong? Why? Where does truth lie? According to market researcher IDC In 2007 > $140B spent managing the world’s data centers. IT OPEX is growing at 2.5 times the rate of hardware spend and 1/3-1/2 of TCO is spent recovering from problems. The cost of availability now dwarfs the purchase and maintenance cost of technology.

So what have we as an IT industry done to address the problem?

We’ve created concepts like ITIL and CMDBs. While there are some good processes improvements here for sure, these top down modeling approaches and pre-determined rules only tell us what we already know. In my experience it is not the things we already know about that bite us in the ass and take our systems down for prolonged periods of time. It’s the multitude of unanticipated and unavoidable dependencies and interactions that take place in an complex system. And it’s impossible to know what set of dependencies and interactions will cause downtime until it occurs. Our infrastructures are just too indeterminate. That’s the point after all. Tier it, load balance it, virtualize it. So we don’t have to worry about the dependencies and interactions among all the different components. Well guess what? We do have to care. Because we have to fix it when it goes wrong.

Take the analogy of a complex air traffic control system. Sure the air traffic controllers feel really great when they arrive at work in the morning. They’ve got their coffee, flight plans and a good handle on the early morning inbound and outbound traffic.

flightplan

Then the day gets a bit more challenging. Weather conditions over Chicago backs up landings at O’Hare. A baggage handler and mechanic strike slows down JFK departures. A pilot radios he’s three degrees north over Pennsylvania but where is he really? Now you need radar. Throw the flight plans out the window. You needs to know what’s actually happening now.

radar

So how do we establish the equivalent of radar for a complex IT infrastructure. Component monitoring doesn’t work any more. If the problem is a single component failure, we already know about it. We’ve already automated the swapping in of a new machine or device. And we can reboot software components automatically. IBM’s has their own marketing play on this called “Autonomic Computing” but that too seems to only focus on the simple single component issues not the indeterminate chaos that ensues in a real running system. And it seems like more slideware than real solutions.

In my next post I’ll tackle the issue of how we might look at things differently.

Stay tuned.

Ode to Log Management

I love “log management.” I hate log management.

I love log management because years ago it was the impetus for IT to move beyond simple SNMP monitoring to collecting and trying to understand a much richer set of data about complex environments.

I hate log management for over the years it has been co-opted by vendors and analysts who’ve pigeon holed it into yet another IT management silo. These vendors and analysts have narrowly defined log management as the collection and storage of logs in some locked repository used to generate static reports to satisfy regulators, auditors and IT governance boards.

Why am I so bitter?

First it turns out logs are critical to many other stakeholders in the enterprise. Operations needs real time access to logs in order to find and fix problems and improve mean time to recovery (MTTR). Security needs logs to catch bad guys. Business people need logs to understand customer and service behavior and provide service level measurements. So locking up logs in a static repository designed for one constituency severely limits their value and diminishes the return on investment not only in a log management solution but also the return on your IT assets overall.

Secondly logs alone don’t provide anyone of the IT stakeholders with a complete picture.

Let’s take a simple example right from the hottest compliance use case today — PCI. The Payment Card Industry (PCI) Security Standards Council founded by American Express, Discover Financial Services, JCB International, Mastercard and Visa has outlined requirements for security management, policies, procedures, network architecture and software design. If you are a merchant accepting credit or debit cards and you process more than 20,000 transactions per year there are twelve specific requirements. Failure to comply with the requirements is not an option. You can be fined heavily and you can lose your ability to accept credit and debit cards.

One of the twelve requirements is the commitment to monitoring and investigating changes to configuration and password files for any application, server or device involved in the processing of card holder information and transactions. In the case of file content, permissions or attribute changes, logs will only tell me part of the story. Yes a Windows, Linux or Unix log will tell me a file has been changed but it won’t tell me who changed it. It also won’t tell me if the change was authorized or not. To understand who changed a file I need to look at the other user processes running on that server at the same time the file was changed. What user processes were running and who owned them? In Unix or Linux this information is easily viewed with a simple “ps” or “top” command but doesn’t exist in any log. In order to understand if the change was authorized or not I need to compare the log and file change information with the user information and any tickets from the service desk authorizing this user to make this type of modification.

The real reason I believe we need to move on from talking about log management is log management isn’t a market. It isn’t a solution. It is a feature in a much broader landscape of harnessing all the data being generated by our IT infrastructures.

Turning all that data info information for every stakeholder is important to the future of IT as environments grow more complex, dynamic, service oriented, virtualized and mission critical. Not just to report on compliance controls, but to improve our speed of root cause analysis, increase our ability to quickly and comprehensively investigate security attacks and develop more intimate relationships with our customers by better understand their behavior and providing a transparent view of the services they are receiving in return.

New Splunk Apps Launch at Interop and MMS

logo_interoplv2008_large.png

logo_mms_large.png
This week we were rolling in Las Vegas with Interop at one end of the strip and the Microsoft Management Summit at the other end.

At Interop we launched the Splunk for Change Management app. And at MMS the Splunk for Windows Management app made it’s debut.

Both apps make use of the Splunk Platform which provides a common set of services and APIs making it easy to create and integrate applications that leverage vast amounts of IT data. These are the second and third applications in a series of new releases we’ll be doing this year.
Splunk for PCI was the first app launched last quarter.

Splunk for Change Management App

Splunk for Change Management takes advantage of the fact that we index not just logs but configurations and file system changes as well. It also leverages a little known (but I think soon to be much more popular) Splunk search command called diff. Diff lets you easily compare two search results and returns a single result that is the different between the two. You can compare values of specific fields of results as well as every line of multi line events and files. This makes it really easy to compare configurations across lots of locations. Splunk for Change Management leverages these capabilities and brings integrated change audit, change detection and change validation.

Now your can detect unauthorized changes by indexing your trouble tickets and ticketing system logs together with your service, device and application events and configurations. We use Jira internally and find indexing our Jira tickets enables us to immediately know if a change was authorized or not. No more jumping between redundant and siloed consoles searching for the answer or writing all kinds of complicated data transformation scripts to compare the output of different management systems.

And for the first time we introduce to the industry the concept of Change Validation. Today many of us have the ability to blast out patches to hundreds of servers and device automatically. But how do we know that the changes had the desired effect? By observing the state and events generated by the actual patched systems we can now compare the before and after actual behavior. Splunk brings change audit events and configuration data together with activity and error logs so you can connect change with actual system and user behavior.

The app includes:

  • Out-of-the-box dashboards with over 40 reports showing changes across all datacenter components including applications, servers and network devices.
  • Predefined alerts that detect unauthorized change on the basis of configuration variances and correlation with service desk systems.
  • Predefined searches to help identify service-impacting changes quickly.
  • Integration with service desk systems to close the loop on change management by validating the effect of change on system behavior.

Splunk for Windows Management App

This new app integrates Microsoft’s System Center Operations Manager’s command-and-control view of a Windows infrastructure with Splunk’s IT Search. The latest version of Splunk now indexes all IT data generated by Windows servers and applications — event logs, registry keys, performance metrics and application log files. Everything is searchable from a single place to resolve service-impacting incidents faster, enhance monitoring coverage, and validate service levels.

What’s really cool is Splunk searches can be launched through Tasks in the System Center Operations Manager Console on any aspect of the infrastructure being monitored, and can be expanded to include far-flung elements of the IT infrastructure for additional context – regardless of platform or technology. Its super fast to identify information across the Windows Event Log, the Windows