IT Search vs. SIEM - Data Collection
I have a lot of conversations lately about the topic of IT search versus SIEM (security information and event management), the more traditional way of doing security event management. People are asking me how Splunk’s technology is different from all the log management tools. With ArcSight (my former employer) going public, LogLogic going through some turmoil in their executive management, and Splunk that just got an amazing round of investment, people are very interested in understanding what the deal is.
The topics of SIEM and IT search are fairly similar. However, there are some very important differences that I want to start pointing out in a series of blog posts.
Let me start with the topic of data collection. In an SIEM system, you use a collector, a connector, or an agent (I don’t really care what you call it, but it’s some piece of code which reads the data and feeds it into the system) to process the data before you can use it in your SIEM for correlation, reporting, or forensic purposes. If you do not have a connector specifically written for your data source, you are out of luck. Just to be clear, I am not talking about having a connector for files, for ODBC, for SNMP traps, or for syslog over UDP/514. I am talking about a connector for each specific data source: Snort syslog, Snort database, CheckPoint OPSEC, CheckPoint syslog (do they have a syslog output?), PIX over syslog, CISCO router over syslog, etc.
What this means is that the SIEM has to either already support your data source or you need them to build you a connector; or you build it yourself. Most of the SIEM tools have some sort of an SDK that you can use for this purpose. However, do you have the man power and the skills in-house to do so? If not, does the SIEM company have the bandwidth to build your connector in acceptable time?
What happens, if the source data format changes? For example, Snort might slightly change its syslog format. Guess what has to happen. Yes! The connector needs to be updated to support the new format. This could mean a down-time of your data source of a few days, if you don’t plan accordingly and get an updated connector right away.
No connector - No data
Now, what is the deal in the IT search world? Well, you need some sort of connector as well. However, you only need one to transport the data from the data source into the search system. In other words, you need about a handful of connectors: One for ODBC, one for receiving syslog on UDP/514, one for text-files, and one for databases other than ODBC. (Okay, okay, I will add one for CheckPoint’s OPSEC). That’s it. You don’t need a specific connector for each data source. You also don’t have to update every time the data source decides to slightly change the logging format. [And if you think that never happens, have a look at SiteProtector.]
What does this mean? It means that from day one that you install your IT search technology, you are able to work with your logs. You don’t have to wait until the right connector is available.
So much for now. In my next post I will talk about structured data.


December 3rd, 2007 at 11:18 am
Raffy,
IMHO it seems that you are comparing IT search with an specific architecture of SIEMs, even if this is done by almost all products in the market.
SIEMs can also get the logs before performing any kind of normalization, avoding the problem of having to alter the agent because of a log format change. However, I don’t know if there is any product that works this way.
regards,
Augusto
December 3rd, 2007 at 1:06 pm
Ye bastard
Since when simply getting a new CEO qualifies as “turmoil”?
December 3rd, 2007 at 4:28 pm
Well… Turmoil equals: VP Engineering left, VP Marketing left, VP Product Management was made CEO, and most recently, there is a new CEO. Is this not turmoil?
December 4th, 2007 at 1:11 pm
Interesting post about data collection. However, I wasn’t sure if you were supporting an agent versus agentless architecture. To be fair there are more reasons to deploy an agent then you mentioned. The first challenge of log management is first to gather all of the logs in a single place then be able to quickly review and correlate. Although agentless architectures, seem easier to manage they can offend create huge management challenges. How do I know I have all the data? Syslog over UDP is fire and forget. How can I tell an auditor I think I have all the logs? Another point is managing the flow of data. Without a method for managing the flow and bandwidth I could oversaturate a network with useless ping flood logs or DOS. Agents allow me an opportunity to control and secure the flow of data. I can also filter and aggregate events before sending them. Another advantage of agents is to normalize and categorize data before it reaches the end repository. I don’t think I need to explain this advantage.
In summary, I think the best data collection solutions should be able to support agents and agentless deployments. The customer should define the pros and cons, not the vendor.
Michael
December 4th, 2007 at 11:40 pm
Raffy…. who is LogLogic.. Never heard of them.
-Lachlan
December 5th, 2007 at 9:04 am
Burning some bridges, are we? Just remember, it’s a small valley…
December 5th, 2007 at 11:01 am
I am not trying to burn any bridges. Not at all. It would be slightly counter productive for me to do so
I am simply outlining the impacts that you encounter if you need structured data. Obviously there is reason and benefit to first parsing your data to impose structure. It’s an information theoretical problem. If you add information at the time of collection you have more information available during analysis time. If you are not adding information at collection time, you have less information available at processing time and you might have to add that information during processing. There are pros and cons - just like the time/space trade-offs in programming.
December 5th, 2007 at 1:31 pm
Michael, I am supporting whether an agent-less, nor any other method. I like that fact that you bring up the whole agent discussion. It’s a very marketing driven thing. I will repeat what I said in the post already: I don’t care how you get the data. There is always some piece of code that processes the information (event, log, etc.) and feeds it into the system. Agent or not, I don’t care.
The question you need to start asking is whether that piece of code HAS to live on the machine that generates the data. Most SIMs I know have either capability. Whatever the data source and your security policy demands.
Again, the post has nothing to do with agents! It’s about data! Most of your comments are about agents and transport. However, they are valid and good points.
The comment about categorization - let me extend this to generic event enrichment - is definitely true and goes back to my other comment about information in events. The more processing you do on the collector side, the less you need to do centrally…
December 6th, 2007 at 1:44 pm
Well, “parsed / structured LATER [i.e. after rules are written]” vs “unparsed / text NOW” is a major debate, that is for sure. I have a sneaking suspicion that “BOTH” with be the correct answer for the foreseeable future.
December 7th, 2007 at 12:09 pm
If I could ask a rather basic question, is it not true that SIEM is proactive, whereas IT Search is done on occasion or after the fact? I have not used Splunk, but am an old security hand now working in data virtualization. I can certainly see the advantage of simplicity in IT Search (nice positioning, BTW), but I just wonder if it addresses the fear of attack customers might have.
Thanks in advance for the education.
Tim
December 7th, 2007 at 12:30 pm
Tim, you are getting into a discussion of real-time correlation. I will pick that up in a future post. I would not call it pro-active. It is still sort of after the fact. Only after you saw some artifact, will you be able to do something about it. No data, no action. I will make a not to make sure I cover this in a future post. Thanks for the comment.
December 7th, 2007 at 1:31 pm
Raffy - thanks for the answer. I’ll keep an eye out for your post on correlation.
December 7th, 2007 at 2:34 pm
Did someone call someone a bastard? Don’t you guys have anything better to do?
December 12th, 2007 at 2:12 pm
I would say there is no “agentless”. Even a syslogd is some sort of agent.
The question is instead, how many agents are needed. I guess from a security point of view, it is better to have fewer agents, and those communicating over a secure channel.
January 3rd, 2008 at 10:39 am
[...] across an interesting blog entry by Raffy at Splunk. As a marketing guy I am jealous as they are generating a lot of buzz about [...]
April 6th, 2008 at 11:27 am
[...] already seeing a great deal of value from operations-side search, and the extension of that value due to platform approaches, APIs and open collaboration. Splunk is [...]