Cisco CSIRT Presents at SplunkLive Raleigh

Last Thursday Dave Schwartzburg and a few other Cisco security mavens attended SplunkLive Raleigh. The Cisco Computer Security Investigation Team (CSIRT) has been a applying Splunk to corporate security investigations for more than two years now and Dave was generous enough to share their experiences with us all. Joining Cisco presenting at the event was James Ervin of University of North Carolina Chapel Hill, a very knowledgeable Splunk customer. Patrick Ogdin, Splunk Sales Engineer gave a rocking good demo of transaction tracing in a telco provisioning environment and Will Hayes, Splunk Sr. Solution Architect showed the latest Splunk for Cisco Security App being developed together with the Cisco CSIRT team.

Cisco CSIRT Team

Dave Schwartzburg

Dave Schwartzburg is an Information Security Investigator and runs the IDS infrastructure for Cisco Corporate and their internal networks and IT assets. He has an M.S. Information Security from East Carolina University and a B.S from the University of Wisconsin. Dave’s been with the Cisco CSIRT team for two years and prior to that was with AT&T Internet Investigations & Security Services. Cisco has more than 100,000 employees and contractors and more than 127,000 devices on their corporate network. That’s a lot to keep track of which is why the CSIRT team utilizes Splunk.

The Cisco CSIRT works to reduce the risk of loss as a result of security incidents for Cisco-owned businesses. CSIRT regularly engages in proactive threat assessment, mitigation planning, incident trending with analysis, security architecture, incident detection and response. This happens in three phases, investigations, mitigations and prevention.

A Tier 1 Event Analysis Group is located in Costa Rica. They handle security threat monitoring. The Tier 2 Event Analysis Group in Bangalore handles the easier case investigations and mitigations. Dave is part of the Tier 3 Global Incident Response Team handling more difficult cases and longer term prevention through changes to the infrastructure and security systems.

Cisco Security Environment

Cisco regularly collects web proxy (Ironport WSA), anti-virus (Ironport ESA), host-based intrusion protection (Cisco Security Agent), syslog, VPN logs, authentication messages, network IDS signatures and Netflow records from critical subnets.

  • 3 million IDS events per day
  • 3-5 billion Netflow records per day
  • 300 malware-related cases a day

Some event sources send their data to a global network of collection servers and some event types are pulled from their sources directly to a centralized server. Splunk handles the collection and indexing of the data.

Correlation and Reporting with Splunk

The CSIRT team makes extensive use of scheduled reporting and alerting for proactive monitoring of problems.

In this example, the team is correlating host-based IDS with antivirus logs and running malware reports via cron, using the Splunk CLI. The results of the report are scheduled and E-mailed to EA teams for processing and submission for remediation.

“Red Carpet Reports” monitor executive systems to make sure they aren’t infected or compromised. Here we see an example of the Koobface worm found in CSA logs on an executive laptop.

Finally the team has some way to make use of all the CSA data they receive. One of the most useful has been to pinpoint people disabling Cisco Security Agent itself indicating the machine is now unmanaged.

Results for the Security Team

The resulting productivity from centralized access to multiple data sources has been dramatic. Not only is the team lowering the time to respond to incidents, but they are also allowing lower skilled workers to handle more complex cases.. And surprisingly 10% of cases are no from previously unused/underutilized sources. The value of substantially faster access to important data and correlation across numerous sources for reporting and ad-hoc investigations is incredible.

Splunk for Cisco Security App

Some event sources send their data to a global network of collection servers and some event types are pulled from their sources directly to a centralized server. Splunk handles the collection and indexing of the data.

University of North Carolina Chapel Hill

James Ervin

James has been a doing system administration, network and security monitoring and application development with UNC since 1998 when he completed his MS in Computer Science NC State University. As part of the Information Technology Services (ITS) team at UNC his projects have included work on the university’s original Active Directory deployment, Unix-based webmail systems and security and information event monitoring. Earlier this year he inherited a centralized logging project for the university. UNC was the nation’s first state university, serving North Carolina for more than 2 centuries with 29,000 students and 4,000+ Faculty members. ITS is the largest IT organization on campus (~500 employees) looking after financials, admissions, centralized learning and centralized email. ITS frequently collaborates with other campus IT organizations of which there are many.

ITS Environment

The ITS team manages a moderate size mixed application, server and networking environment consisting of the following major components.

  • Multiple Unix flavors (AIX, RHEL, Solaris)
  • Large Windows infrastructure
  • ~600 devices total
  • ~20 IPS/IDS/FW/LB devices
  • PDU, environment probe data
  • Apache, Tomcat, JBoss

This environment is constantly in flux as students and faculty come and go and non-managed desktops, laptops and mobile devices connect to the network.

“We needed to determine what is possible within our environment and adopt a flexible architecture.”
- James Ervin

Earlier this year, James and his team were facing an every growing list of requirements for their centralized log management project including:

  • Make syslog services more useful to the rest of the IT organizations
  • Collect and centralize Windows event logs
  • Alert on events of interest
  • Correlate security events
  • Provide NOC/SOC staff access to security logs
  • Give application developers access to application logs
  • Report on unplanned system changes
  • Satisfy the auditors

Evaluation Process

The ITS team reviewed a number of log and event centralization technologies including the possibility of building their own, before deciding on Splunk. Database-backed products were dismissed because they require tight control over log sources in order to be able to process incoming data properly (format changes could cause incoming data to drop). Few solutions could pull any intelligence out of arbitrary, unstructured data and customization was often difficult or required professional services. Some products imposed severe limitations on clients and users, and ITS wanted to grant access widely to enable other IT departments to do their work. Finally log appliances offered a degree of customization less than desired; James wanted an “open” architecture capable of handling arbitrary inputs and outputs with reasonable effort.

The Splunk Deployment

UNC’s Splunk deployment includes a single Splunk indexing server that is fed by many different sources. New sources arrive almost daily as new applications and servers are installed around the university. An existing centralized syslog server feed Splunk. Approximately 80 Splunk forwarders on high-interest servers (AD domain controllers, Apache etc.) feed Splunk. And a “dropbox” indexes one-time batch uploads. The primary index size is ~1TB and data is kept online for 90 day retention. The university SAN is storage on the back-end and more than 80 users are sharing saved searches, reports and dashboards. Users have a long-tail distribution: a few “power users”, lots of “casual users”.

Measuring Success

Did it work? What I really liked is the simplistic but powerful way James and his team measured their success with Splunk. The team asked themselves a few fundamental questions which demonstrates the project was a lot more about solving problems than just generating some compliance reports.

  • What have we done with it that we expected to do?
  • What have we done with it that we didn’t expect to do?
  • How successful have we been?
  • What lessons have we learned?

The UNC team particularly like the fact that Splunk has no per client / per user license cost and that work can be distributed more effectively, data accessible to those who need it. James also likes Splunk because it can ingest any data you throw at it and Search-time extraction is infinitely easier to manage than index-time extraction.

Issue Identification and Troubleshooting

The first thing they looked at was how Splunk helps issue identification. IT Search, as it turns out just like Web Search is a metaphor that empowers end users; intimate knowledge of the systems or data is not required to get results.

‘Splunk often produces serendipitous results the “look what I found!” moments.’

In many of the UNC scenarios, Splunk provides the “what is actually happening” view that like the university ITS team, so many IT organizations lack.

Client Remediation and Security Analysis

One of the security problems at a big university are client computing devices. Identifying owners of laptops, desktops and PDA that are infected, in violation of acceptable use policy, have been stolen or causing network trouble requires a data gathering process and specialized knowledge. Use Splunk to tie search results (DHCP logs, antivirus logs, etc.) to the client registration database allows results to be “doped” with additional data from the live registration database.

“In the case of a stolen laptop, input an IP/MAC address and Splunk returns the owner’s name and last known location used on the network.

Another key security driver at UNC is security event correlation including correlation of IDS/IPS events with server and network events for short-term alerting and long-term reporting. Splunk is correlating IDS/IPS data (Snort, etc.) from multiple sensors and issue alerts based on thresholds and combinations of events representing specific situations.

  • more than 10,000 hits from a single source over a time period
  • more than 15000 hits from multiple sources over a time period (DDOS detection)
  • hits for high-risk signatures

The combination of pre-defined search alerts and ability to do real-time arbitrary correlation (e.g. free-text search lets us correlate any attacker IP with events across ALL log sources via a single search) is really powerful.

James has found Splunk goes beyond a typical security event correlation in other ways too. Being able to audit all kinds of system and user activity provides the type of birds eye view the team never had before. Examples include:

  • Report of administrator account usage in entire AD forest used by AD administrators to discourage use of admin accounts on untrusted machines that might be keylogged
  • Geolocation of IDS/IPS events via SDK and MaxMind GeoIP database allows security team to “eyeball” results, eliminating tedious investigative steps
  • Web-based password change utility was being brute-forced; Splunk now reports when the number of requests to this page exceeds a threshold
  • Classroom Support uses a Splunk-generated report to track student lab usage

Lessons Learned

Perhaps the biggest lesson UNC has learned to date is how unanticipated uses are often as important as the anticipated ones. “Teach a man to fish…” the saying goes.

‘How successful is Splunk? One of our users was quoted saying, “Thank god for this.”

Simplicity is a virtue. Complexity is also a virtue. Splunk provides both a simple interface and a more powerful customizable interface if you want to dig further. But the real power is in giving people tools that help them think, not turn off their brains and stare at red, yellow or green. Of course the UNC team also commented that they’ve learned products are not substitutes for policy, but policy is no substitute for reality. And there is no shortage of unenforceable policies at the university.

‘The Splunk flexible architecture helps us to achieve the “middle ground” between what we need and what is achievable. New problems always emerge as old ones are solved. A good architecture enables you to solve the new problems, rather than forcing the new problems to fit into the old box.’

Unanticipated Benefits

So what else can a flexible architecture that’s easy to implement do for a centralized logging infrastructure? Well, no more local logging for one. Some servers simply can’t log locally due to volume, performance, etc. This is bad from an auditing standpoint, although your policy may be to retain all logs locally for the amount of time required by legal and industry regulations. Splunk uses a local forwarder to route data over the network without logging it locally. Even if the network goes down Splunk won’t lose events. The result is an ability to run transactional searches on high-volume log sources, without impacting the original service or developing specialized SQL or reporting applications.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*