Raffy: Archive for March, 2008

All the Data That’s Fit to Visualize - SOURCE Boston 2008

img-62_t.jpgI was giving a talk at SOURCEBoston 2008. The topic this time was around general visualization and what has gone wrong in security visualization in the past. I showed how we can learn and steal from other disciplines, in this case, the New York Times. The NYT has done some pretty fantastic work in the area of data visualization. Their interactive market map, for example, is a great way of exploring stock data. During the talk, I outlined some of the design principles that the NYT graphics department is using when they are designing their graphs: Show - Don’t Tell.


To start my presentation, I showed a little video about security visualization (see below).

2340391938_67b956ed2e.jpgAt conferences lately, I find myself not to be the only one that talks about security visualization. More and more presentations are showing visualizations. A lot of projects are using visualization to help them analyze all the data at hand. At SOURCE, Dave Dittrich from the University of Washington, talked about BotNet analysis and visualizing network traffic captured from BotNets. He definitely has a challenge of displaying large amounts of data. We discussed some approaches and possibly, parallel coordinates, could work for his data. Parallel coordinates are what I used in my book for some BotNet traffic analysis.

Common Event Syntax

cee-logo.gifAs part of the common event expression (CEE) effort, a list of field names has been published.

If log records from different log sources have to be correlated or reports have to be generated across different log sources, a common set of field names is needed. Take a firewall log example. Assume that you have two types of firewalls in your environment: Netscreen and PIX. Both devices write different types of log entries. Assume you have a parser that extracts fields from the two logs. Each of the parsers might call fields differently, making it either impossible, or really hard to correlate these two log files. Just think about reporting. How do you find the top source addresses across both logs? These are logs from each of the firewalls:

Netscreeen:

May  5 17:01:40 45.2.0.1 NOC-FWa: NetScreen device_id=NOC-FWa [Root]
system-notification-00257(traffic): start_time=”2006-05-05 17:01:40″
duration=0 policy_id=52 service=tcp/port:26212 proto=6 src zone=backbone
dst zone=noc-mgt action=Deny sent=0 rcvd=0 src=222.81.119.59dst=45.2.121.102
src_port=7000 dst_port=26212

Pix:

Jan 18 12:43:50 192.168.1.1 %PIX-6-106015: Deny TCP (no connection)
from 208.58.193.69/1062 to a.b.c.d/443 flags ACK

If you report on “src”, you won’t get the “from” from the PIX log. We need unified names.

It is not just important to have a common set of names, but also a common understanding of what individual fields mean. What is the semantics of a field? For example, how do you measure a duration? In seconds? Hours? Days? What is a destination host? Is it fully qualified or just the host name itself? The field list, which can be found in this post: CEE Fields List is a first step towards standardizing this.

Note that, for example, ArcSight’s CEF publishes a dictionary along with their log syntax. The CEE field list can be used to standardize the names across various log formats and can hopefully substitute and expand ArcSight’s dictionary.