thebaumblog: Splunk Live

Cisco CSIRT Presents at SplunkLive Raleigh

Last Thursday Dave Schwartzburg and a few other Cisco security mavens attended SplunkLive Raleigh. The Cisco Computer Security Investigation Team (CSIRT) has been a applying Splunk to corporate security investigations for more than two years now and Dave was generous enough to share their experiences with us all. Joining Cisco presenting at the event was James Ervin of University of North Carolina Chapel Hill, a very knowledgeable Splunk customer. Patrick Ogden, Splunk Sales Engineer gave a rocking good demo of transaction tracing in a telco provisioning environment and Will Hayes, Splunk Sr. Solution Architect showed the latest Splunk for Cisco Security App being developed together with the Cisco CSIRT team.

Cisco CSIRT Team

Dave Schwartzburg

Dave Schwartzburg is an Information Security Investigator and runs the IDS infrastructure for Cisco Corporate and their internal networks and IT assets. He has an M.S. Information Security from East Carolina University and a B.S from the University of Wisconsin. Dave’s been with the Cisco CSIRT team for two years and prior to that was with AT&T Internet Investigations & Security Services. Cisco has more than 100,000 employees and contractors and more than 127,000 devices on their corporate network. That’s a lot to keep track of which is why the CSIRT team utilizes Splunk.

The Cisco CSIRT works to reduce the risk of loss as a result of security incidents for Cisco-owned businesses. CSIRT regularly engages in proactive threat assessment, mitigation planning, incident trending with analysis, security architecture, incident detection and response. This happens in three phases, investigations, mitigations and prevention.

A Tier 1 Event Analysis Group is located in Costa Rica. They handle security threat monitoring. The Tier 2 Event Analysis Group in Bangalore handles the easier case investigations and mitigations. Dave is part of the Tier 3 Global Incident Response Team handling more difficult cases and longer term prevention through changes to the infrastructure and security systems.

Cisco Security Environment

Cisco regularly collects web proxy (Ironport WSA), anti-virus (Ironport ESA), host-based intrusion protection (Cisco Security Agent), syslog, VPN logs, authentication messages, network IDS signatures and Netflow records from critical subnets.

  • 3 million IDS events per day
  • 3-5 billion Netflow records per day
  • 300 malware-related cases a day

Some event sources send their data to a global network of collection servers and some event types are pulled from their sources directly to a centralized server. Splunk handles the collection and indexing of the data.

Correlation and Reporting with Splunk

The CSIRT team makes extensive use of scheduled reporting and alerting for proactive monitoring of problems.

In this example, the team is correlating host-based IDS with antivirus logs and running malware reports via cron, using the Splunk CLI. The results of the report are scheduled and E-mailed to EA teams for processing and submission for remediation.

“Red Carpet Reports” monitor executive systems to make sure they aren’t infected or compromised. Here we see an example of the Koobface worm found in CSA logs on an executive laptop.

Finally the team has some way to make use of all the CSA data they receive. One of the most useful has been to pinpoint people disabling Cisco Security Agent itself indicating the machine is now unmanaged.

Results for the Security Team

The resulting productivity from centralized access to multiple data sources has been dramatic. Not only is the team lowering the time to respond to incidents, but they are also allowing lower skilled workers to handle more complex cases.. And surprisingly 10% of cases are no from previously unused/underutilized sources. The value of substantially faster access to important data and correlation across numerous sources for reporting and ad-hoc investigations is incredible.

Splunk for Cisco Security App

Some event sources send their data to a global network of collection servers and some event types are pulled from their sources directly to a centralized server. Splunk handles the collection and indexing of the data.

University of North Carolina Chapel Hill

James Ervin

James has been a doing system administration, network and security monitoring and application development with UNC since 1998 when he completed his MS in Computer Science NC State University. As part of the Information Technology Services (ITS) team at UNC his projects have included work on the university’s original Active Directory deployment, Unix-based webmail systems and security and information event monitoring. Earlier this year he inherited a centralized logging project for the university. UNC was the nation’s first state university, serving North Carolina for more than 2 centuries with 29,000 students and 4,000+ Faculty members. ITS is the largest IT organization on campus (~500 employees) looking after financials, admissions, centralized learning and centralized email. ITS frequently collaborates with other campus IT organizations of which there are many.

ITS Environment

The ITS team manages a moderate size mixed application, server and networking environment consisting of the following major components.

  • Multiple Unix flavors (AIX, RHEL, Solaris)
  • Large Windows infrastructure
  • ~600 devices total
  • ~20 IPS/IDS/FW/LB devices
  • PDU, environment probe data
  • Apache, Tomcat, JBoss

This environment is constantly in flux as students and faculty come and go and non-managed desktops, laptops and mobile devices connect to the network.

“We needed to determine what is possible within our environment and adopt a flexible architecture.”
- James Ervin

Earlier this year, James and his team were facing an every growing list of requirements for their centralized log management project including:

  • Make syslog services more useful to the rest of the IT organizations
  • Collect and centralize Windows event logs
  • Alert on events of interest
  • Correlate security events
  • Provide NOC/SOC staff access to security logs
  • Give application developers access to application logs
  • Report on unplanned system changes
  • Satisfy the auditors

SplunkLive Seattle Kicks IT

On what was an incredibly beautiful day we had more than 100 Splunk devotees attend our first ever SplunkLive event in Seattle last week. In the shadow of Microsoft we talked about our Windows and Microsoft strategy and compare notes with lots of customers that are running mixed Microsoft, Linux, Solaris environments. Many of our customers with Microsoft Active Directory, Exchange and SharePoint environments are utilizing Splunk to troubleshoot problems and implement security and compliance controls in large-scale, distributed environments. But, I’m still surprised at how little Microsoft .NET we’re seeing in production large-scale applications.

Three Seattle-based customers presented their views on managing mission critical applications, IT data consolidation and Splunk.

  • T-Mobile USA
  • Blue Nile
  • Washington State University

T-Mobile USA

Sean White, Senior Engineer with T-Mobile Operations in Bellevue talked with us about their global rollout of Splunk. Sean is a member of the security engineering team charged with incident response, IDS, vulnerability scanning, anti-virus and enterprise unified logging. He graduated with a B.S. in Computer Science from University of Kansas and has a deep background in large telecom environments initially as a system administrator and webmaster, SS7 network C&C and performance, engineering and now in information security. Sean has been at T-Mobile for 4 years, prior to that at Cingular, AT&T Wireless. T-Mobile USA is the 4th largest US national provider of wireless voice, messaging, and data services to 34M subscribers with annual revenues of $17B. T-Mobile USA is the US operating entity of T-Mobile International AG, the mobile communications subsidiary of Deutsche Telekom AG (NYSE: DT). Deutsche Telekom is one of the largest telecommunications companies in the world, with nearly 120 million customers worldwide

It all started with PCI Compliance

Like many of our enterprise customers, T-Mobile started working with Splunk in one area but quickly saw the value of expanding into others. For Sean and his team, PCI Compliance was the beginning of the Splunk solution footprint, but soon everyone realized the consolidation of logs, events, messages, configurations and changes meant a whole lot more.

Beginning with proving PCI compliance, T-Mobile has very specific requirements. PCI Section 10: Track and monitor all access to network resources handling cardholder data. But in T-Mobile’s case scale was a big issue. Fulfilling PCI DSS Section 10 meant tracking 26+ in-scope applications and the ability to trace transactions from start to finish across 650+ servers running Windows, Linux and Unix varieties. It also means more than 100 individuals logging into Splunk on a daily basis as part of the process.

The Splunk Set-up

The Splunk configuration consists of

  • Pairs of forwarders set up in each of 4 geographic locations.
  • Three short term indexers + 1 short term search box.
  • Three Long-term search boxes hooked into a 32 TB NAS.
  • Centrally controlled from a single deployment server.

The current installation is indexing more than 600GB/day of data and has just passed the 10B event mark. Controlling access to all this data is critical and T-Mobile has Splunk roles set up for managers and application teams to limit access to subsets of the data. The ability to segregate data access along lines of duties is critical to prove PCI compliance.

The Business Case for a SOC

In addition to proving PCI Compliance, T-Mobile has discovered Splunk’s use for Security as well. Not long ago, a SIEM vendor would have told you IDS and firewall logs were all you need. That >=2 sources of data == correlation. Not so much.

“All the best new vulnerabilities are coming in on the application layer.”
- Sean White

Enterprise logging—visibility into all of your IT data—is absolutely critical in defending against modern blended attacks. At T-Mobile Splunk has become a primary analysis tool for deciphering what is happening to the applications, servers and devices on the network. A few saved searches and Splunk helps does real correlation.

Nothing Boring about Logs and IT Data!

PCI Compliance mandates gave T-Mobile the excuse (read funding) to start an enterprise logging initiative. Logging all security, network and application events can truly give insight needed to not only measure and report on compliance controls but also to run a more secure and effective business. PCI has also discovered that integrating the ability to ask any question of their environment and get immediate answers also provides a pile of value to the help desk operations and better business intelligence functions.

“All the information about your company is in your logs—there’s nothing boring about it.”


Blue Nile

Jerry Brennock, Director Core Development at Blue Nile explained how the company is using Splunk to improve the experience of buying diamonds over the Web. Blue Nile, Inc. is an online retailer of diamonds and fine jewelry offering in-depth educational materials and unique online tools that place consumers in control of the jewelry shopping process. Importantly, the focus is on giving customers a great experience at a a great price – this translates to requiring high quality at a low cost. Jerry’s team team builds and support the infrastructure and applications for merchandising and marketing, including the website. He’s been with Blue Nile for 10 years and in the e-commerce space for more than 17.

The Killer Diamond App

Diamond Search is undoubtedly the killer application for Blue Nile’s E-commerce experience. It’s an asynchronous javascript app that has to work across any browser and there are many non-obvious use cases. All three of these factors means it is prone to failure in lots of edge cases.

“If this application isn’t fast and accurate, we don’t sell diamonds.”
- Jerry Brennock

Jerry’s team has embedded tracking pixels with name value pairs to track JavaScript profile information from each diamond search. This together with Web server 500 and 404 errors give the development, operations and customer support teams all the data they need to troubleshoot problems. The challenge is finding customer problems “in the moment” before the sale is lost.

Splunk Live Taipei Breaks All Records

More than 300 people attended Splunk Live Taipei last week and our partners at Systex hosted an incredible show of Splunk use cases, customer speakers and hands-on labs. The Systex Splunk Lab provided attendees with the opportunity to use Splunk with CICS and IBM System z mainframe data, Windows, servers and desktops, Unix and Linux, customer service operations environments, telco provisioning environments and more.

I’ll be posting separately on the hands on the Systex Splunk Lab.



Our first guest customer speaker was Yi-Lang Tsai(蔡一郎) the Taiwan Chapter Chief Security Officer of the Global Honeynet Project and the Division Manager of the National Center for High-performance Computing, a Honeynet Project sponsor. Yi-Lang is also a freelance writer with more than 30 books published on operating systems, network and system security and IT management. He presented the very important botnet work Honeynet Project is doing and showed how his team is using Splunk to deepen their research and expose what they find to the Honeynet audience of security professionals worldwide.

What is Honeynet?

The mission of the Honeynet Project is to learn the tools, tactics, and motives of the blackhat community, and share the lessons learned. Honeynet is an all volunteer organization of security professionals around the world dedicated to researching cyber threats by deploying networks to be hacked. The goals are

  • Awareness: to raise awareness of threats that exist,
  • Information: for those already aware, tech and information about threats and
  • Research: To give organizations the capabilities to learn more on their own.

Honeynet is completely open source and all of the work, research and findings are share. Everything captured is happening in the wild (there is no theory). The organization has no agenda, no employees and no product or service to sell.

Honey is simply a “high-interation” honeypot attracking any and all cyber threats and attacks. It is architecture, not a product or software that gets populated with live systems donated and run by the various Honeynet chapters globally.

Once the Honeynet is compromised, data is collected, correlated and analyzed to learn the tools, tactics, and motives of the blackhat community. Specific benefits to the global community of security professionals are the

Research : Identifying new tools and new tactics,
  • Profiling: Generating and maintaining lists of blackhats,
  • Protection: Early detection, warning and prediction,
  • Response: Forensics and incident response and
  • Self-defense.

    Taiwan Honeynet Chapter’s Environment

    Yi-Lang’s environment at the Taiwan National Center for High Performance Computing disitribuytes Honeynet/Honeypots to the Taiwan Education Network, Taiwan Chapter members and the GDH project. The environment makes heavy use of virtualization in its deployment, you might call it a “Virtual Machine Honeynet.” Its running on an advanced blade server with 128GB of memory running VMware ESX. The blade server uses either SAS OR SSD storage. More than 200 Windows 2K/2K3, Windows XP/Vista/7, Linux and FreeBSD servers run in high and low interaction honeypots.

    The Taiwan Honeynet deployment is distributed across four different data centers in different geographies Taipei, Hsinchu, Taichung and Tainan. This distributed topology allows the honeypot to have a broad reaching capture network and makes use of idle network and CPU. This large-scale Honeynet deployment supports:

    • Malware Collection and Analysis
    • Honey-Driven Botnet Detection
    • Client -Side Attack
    • Malicious Web Server Exploring
    • RFI Scripts Detection
    • Fast-Flux Domain Service Tracking
    • Research Alliance
    • Distributed Search and Analysis on Honeynet Data

    Why Splunk?

    The Taiwan Honeynet teams uses Splunk to collect and manage information from the distributed Honeynet infrastructure including GBs of logs, 400k+ connections, 2GB+ of traffic flows and tools events and metrics.


    http://blogs.splunk.com/thebaum/wp-content/uploads/2009/10/allindexdata.png

    Data analysis is performed against a variety of pivot points that are automatically extracted from the Honeynet data sources. Date & Time, Malware Source IP address, Destination IP, Protocols, Files name and Malware MD5 are some of the main fields Splunk identifies and provides to the team for deeper analysis. In addition to Splunk searches and reports the team has built custom geo-dashboards with high resolution displays by tapping into the Splunk API.

    This interactive geo-view provides the team Botnet detection, malware presence, Honeynet traffic flows and an instant status report all from one location.

    Yong Sweah Liang (Linus), VP, Head of Infrastructure and Technology for Infocomm Asia Holdings Pte Ltd (IAHGames) was our second customer speaker.

    IAH is an online game company operating some major properties including:

    • EA SPORTS™ FIFA Online 2
    • Granado Espada
    • Dragonica
    • Distribution of Box products
    • BioShock®
    • Grand Theft Auto IV



  • Splunk Live Washington DC 2009

    Obama-nomics is highly visible in our nation’s capitol these days. The DC economy is humming as our tax dollars are hard at working fueling all kinds of government spending.With more than 100 attendees at Splunk Live on Thursday we certainly were not disappointed in our quest to help make all this growth in government more efficient! Managing large networks and security forensics were the hot topics of conversation at Splunk Live Washington, DC where everyone was treated to a trio of three incredible speakers.

    Our first speaker was Andy Purdy, the Co-Director, International Cyber Center, George Mason University and the Former Acting Director, National Cyber Security Division (NCSD) and US-CERT Department of Homeland Security. Andy was a member of the White House staff team that drafted the U.S. National Strategy to Secure Cyberspace (2003) and served on DHS tiger team that formed the National Cyber Security Division (NCSD). He was 3 1/2 years at DHS, the last two heading the NCSD and US-CERT as the “Cyber Czar” of the U.S. Andy is also a Special Government Employee on the Defense Science Board Task Force on Mission Impact of Foreign Influence on DoD Software. He is also a partner with the law firm of Allenbaugh Samini Gosheh, LLP.

    The Constantly Changing Threat Landscape

    Andy talked with us about the changing threat landscape and lessons learned from past approaches to cyber security that can be applied in a forward looking approach to Risk Management and Compliance.

    Since much of his experience has been spent preparing the country for what cyber threats are coming next, Andy thinks of IT security as a war fought in a constantly morphing theater with new technologies and vulnerabilities and new motivations and threats.

    A Different Approach Moving Forward

    For anyone serious about security this is a sound perspective whether you are a government agency, a major enterprise or a small business. But, the balance between open networks and services and robust security remains one of the major challenges for IT organization. Andy pointed us to lessons learned from his past, fueling a vibrant conversation during the customer and speaker roundtable. Perhaps the most important thing I heard was it’s not enough to prepare for the last war, or the last successful attack. While perimeter defense and legacy standards for network security are provide some measure of security, those measure are very often insufficient to deal with the new threats that seem to be gaining in sophistication at an accelerating pace. Andy encouraged us to focus on adopting new requirements and security infrastructure for situational awareness and control.

    Greater sophistication, slower, lower-level attacks, greater knowledge about the targets (data, activity, vulnerabilities) are all contributing to the need for near-time visibility on a large-scale. This has become far more important than sub-second correlation of known attack vectors against discrete sets of network devices.

    “NIST perspective: Continuing serious cyber attacks on federal information systems, large and small; targeting key federal operations and assets. Attacks are organized, disciplined, aggressive, and well resourced; many are extremely sophisticated. Adversaries are nation states, terrorist groups, criminals, hackers, and individuals or groups with intentions of compromising federal information systems.”

    Andy went on to discuss how the effective deployment of malicious software causing significant exfiltration of sensitive information (including intellectual property) and potential for disruption of critical information systems/services has made detection of inforation and data leakage a key government and enterprise security requirement.

    Bob Flores, Former CTO and 31 year veteran of the CIA was our next speaker. Bob retired from the CIA six months ago and is now President and CEO of Applicology, providing cyber security and IT strategy consulting services. In his 31 years at the CIA, he held various positions in the Directorate of Intelligence, Directorate of Support, and the National Clandestine Service. Most recently he was the CIA’s CTO where he was responsible for ensuring that the Agency’s technology investments matched the needs of its many missions. Bob has a Bachelor and Master of Science degrees in Statistics from Virginia Tech.

    Quis custodiet ipsos custodes?

    Brush up on your Latin! “Who’s guarding the guards” was the topic of Bob’s talk. Insider threat in an every changing threat landscape was and remains our number one cyber security risk.

    “Defense-in-depth isn’t just about putting adequate technology in place, it’s also about paying attention to your people and implementing policies and procedures to reduce the likelihood of an insider attack.”
    - Dawn Cappell, CERT

    The simple but not so obvious model Bob pursued at the CIA was an extension of the ISO stack to include the non-technical but motivational additions.


    We need to worry about all levels of the stack including layers eight and nine because we all have people messing around at various layers with applications, scripts, communications etc. And their motivation is often very clear.

    Nemo repente fuit turpissimus! Or no one ever became thoroughly bad in one step!”

    The point is people don’t just wake up one day and decide to be bad. They are motivated over time by larger causes and in EVERY CASE leave a trail of clues behind that can’t entirely be covered up.

    What to Do?

    According to Mr. Flores the focus needs to be on real-time visibility. You need visibility into who (or what) is perturbing your enterprise right now and over time. You can tediously review the logs of each device and user as the CIA used to do or you can take advantage of Splunk.

    “Splunk may not be the best thing since sliced bread, but it’s pretty darn close.”
    - Bob Flores

    Why Splunk?

    Why did the CIA choose Splunk over so many other security forensic solutions? It all comes down to how easily and scalable Splunk can eat any logs, events and messages Bob’s organization throws at it. Combine that with the real-time search, alert and reporting and over time statistics and analysis on

    Splunk Live Princeton 2009

    Wednesday and we’re at Splunk Live Princeton, NJ. What an awesome place. Princeton is home to a great university and some great culinary experiences. Check out Mediterra — an interesting mix of Italian and Spanish influences. Apparently it’s where all the Princeton parents treat their kids to dinner when they are in town. Next store to our venue was the great hope for the state of NJ — a new Governor. The current Governor has turned the state budget and tax base into toxic waste. Well things went much better for the more than 60 Splunk Live attendees in Princeton today, who gained insight into how a number of large Splunk customers keep their mission critical applications running in a time of IT budget slash and burn.

    Matthew Stevens, Director Software Systems and Architecture at Comcast provides guidance to Comcast executives on mission critical media systems and strategic systems architecture. Comcast is the country’s largest provider of cable services serving 23.9 million cable customers, 15.3 million high-speed Internet customers and 7.0 million Comcast Digital Voice customers.

    Comcast Developer Network

    Matthew’s latest project is the Comcast Developers Network a Comcast-scale secure web services platform for the development of cool new media and entertainment offerings. The Comcast Web Platform environment generates of billions of software events each day from caching and load-balancing, origin application servers, databases, middleware and content delivery networks for images and video streams. Comcast services demand high quality. Much of the Comcast content is exclusive and premium services drive revenue. Interfaces between technology components (applications, delivery platforms) need to adhere to best practices to ensure the highest degree of end customer experience.

    Why Splunk?

    Comcast has acquired many system and application management platforms over the years, but nothing was providing the team with the robust information from operational telemetry the teams around the company need to ensure data integrity, stability, application quality and efficiency. Several efforts specifically drove Comcast to consider and deploy Splunk.

    • Product rollout: The team wanted the ability to predict and correct potential issues before going live into into production—Splunk has become a required best practice for new product rollouts.
    • Network/ System Integrity: Understanding security and user experience across a very large network and set of systems is a must to protect the business. Splunk provides the insight the network and system teams need across many different silos of technologies.
    • Business Intelligence: Having immediate access to real-time events and historical trends allows the various Comcast business teams to react quickly and adapt to changing customer behaviors.
    • Agility: Alerts and Dashboards indicate discrepancies so distributed teams can investigate immediately and remediate failures and attacks.

    Video CDN/CMS Performance

    “In content management systems and delivery networks a devil walks the long tail. If you’re facing concurrent hits across the tail of the curve, sharpen your pencil, you’ve got problems!”

    Splunk helps Comcast understand the risks of instability in our systems, especially during periods of high concurrency. Through pre-production modeling of even patterns and subsequent monitoring of these patterns Splunk pays for itself by helping Comcast avoid deployment of vulnerable systems, downtime, and upset customers.

    Predicting System Imbalance

    Comcast has successfully used Splunk to evaluate potential infrastructure vendor’s solutions and determine if they will balance loads properly across a large, indeterminate infrastructure. Often the answer is no as illustrated here in a Splunk report of resource utilization across various services.

    Splunk has also been utilized to see whether solutions will be resilient to different traffic patterns, helping the company perform predictive analysis before making critical infrastructure investments.

    Load testing is performed during non peak hours and the results are analyzed for system failures over time using the telemetry data Splunk can correlated across various logs, messages and events.

    When failures are found the Comcast team uses Splunk reports to dig deeper into the data.


    Security and Compliance

    In addition to operations use cases, Comcast security and compliance teams leverage the consolidated logs across data centers to enable faster threat assessment and security monitoring.

    • Monitoring for bad actors to trigger alerts,
    • Conducting threat detection over time,
    • Detecting attacks/vulnerabilities in systems and
    • Auditing systems in support of security assessments and compliance.

    What’s Next?

    Next up for Matthew and team is the launch of the Comcast CodeBig Platform enabling a network of developers to create content for the network. Some of these developers are already using Splunk in their own managed services like Mashery. Comcast is working to hook the Mashery Splunk installation to their own in-order to provide visibility across multiple services and providers of content and entertainment functionality.

    Chris Abboud manages the Enterprise Systems Management team at Dow Jones — monitoring customer facing infrastructure and applications. Dow Jones provides global business news and information services to millions of consumers and enterprise media groups. Keeping these revenue generating services running 7×24x365 is the highest priority. Chris also manages the DJ service management platforms (Remedy, Knowledge Base, etc.) He’s been with the DJ organization for 10 years, in current role for 3 years.

    “Our mission is to address issues before they become service impacting events. Failures are going to happen — we need to make sure people know about them as soon as possible.”

    The Splunk Set-up

    The Dow Jones Splunk installation includes

    • Data from 6000+ servers globally,
    • 13,500 + source types,
    • 1,700 network devices (primarily Cisco and Juniper) and
    • Ten distributed Splunk servers in difference geographies index ~100GB a day and provide a new global logging console.

    Why Splunk?

    Each Dow Jones command center now has the ability to know what’s happening before customers do across a wide range of internal and external services. Splunk speeds the time to resolution for email outages that may impact internal users’ productivity and editorial sites downtime that can directly impact to customer service and revenue. Dow Jones has found Splunk generates significantly fewer false positives than traditional monitoring systems and new resources are much easier to manage and deploy.

    Splunk Live New York 2009

    This week we’re on the East Coast enjoying some fantastic customer presentations and roundtables at Splunk Live events in New York City, Princeton NJ and Washington DC. It’s Tuesday and we have more than 100 customers and Splunk users attending Splunk Live in midtown Manhattan. The vibe is electric as we’re being treated to awesome talks by IDT and New York Life. At lunch, long-term customer’s Bloomberg and AT&T joined the customer roundtable conversation.

    Gabe Arnett, Senior Software Architect at Moody’s demonstrated how Splunk is being used to monitor and troubleshoot the Moody’s Analytics platform. Gabe has more than 15 years of building web applications in financial services, investment banking and e-Commerce. At Moody’s he’s responsible for global development team that develops and supports the newly re-designed client facing website – v3.moodys.com. Moody’s is a leading provider of research, data, analytic tools and related services to debt capital markets and credit risk management professionals. The company’s products and services provide the means to assess and manage the credit risk of individual exposures as well as portfolios; price and value holdings of debt instruments; analyze macroeconomic trends; and enhance customers’ risk management skills and practices.

    Moody’s Splunk environment is utilized by 25 different users and runs on Windows 2003. Splunk provides Gabe’s developers secure access to the logs they need without touching the production devices, servers and applications. His team has built custom searches and a number of dashboards indicating the general health of their applications and service. Custom searches and alerts provide alerts to track errors and access – guaranteeing good user experience. The team also uses Splunk to understand when and where new content isn’t flowing to the v3 platform. A large part of the Moody’s user experience is delivering email alerts and Splunk helps the team track GUIDs to ensure customers receive the alerts they’ve subscribed to.

    The team recently migrated from Splunk 3 to Splunk 4 – taking 30 minutes to perform the upgrade. The Splunk for Windows App has been significantly revamped in Splunk 4 and the Moody’s team is making use of it to monitor through WMI local server resources (disk, memory, networking) and correlate this performance data with the Windows and Application event logs.

    Shay Benjamin, CSO and SVP, Architecture at IDTdesigns and implements network architectures and manages compliance, security and fraud initiatives at IDT. IDT Corporation (www.idt.net) is a holding company focused on the telecommunications and energy industries. Since 1995 they’ve been building hundreds of VOIP switches globally and assembling an international fiber optic network. IDT pioneered VOIP (Voice over Internet Protocol) to create Net2Phone, piloted the first commercial WiFi phone service in the US and has created a prepaid calling card business, which sells 12 million calling cards a month.

    IDT uses Splunk primary for VOIP Call Detail Records (CDRs). The company indexes more than 120 million CDRs per day with six mirrored Splunk server instances. Call Detail Records (CDRs) are somewhat like logs, but with many fixed delimited fields . One or more CDRs are created at each switching or routing point for every VOIP call. CDRs vary between platform devices in number of fields and contents and unlike logs, few CDR fields contain easy-to-read key=value pairs. Although a key piece of maintaining service quality, billing, monitoring network quality and security forensics, working with CDRs is labor intensive and delay wastes labor, time and money.

    IDT needs fast searches across all fields of the CDRs and quick data loading – to allow fast retrieval of call data and cross platform searches to unify results from different CDR formats. Historically IDT utilized a custom RDBMS solution with an application called Call Genius. In their RDBMS IDT was forced to limit the fields that get indexed because indexing of CDRs with an RDBMS is costly as it takes up a lot of space and slows load times. The RDBMS also only indexes fields common to multiple platform’s CDRs. In the RDBMS solution much of the CDR data was put into BLOBs (actually CLOBS) – multiple CDR fields mapped into a single RDBMS field to try and achieve efficiency. But Blobs can be very difficult to search and are difficult to index effectively. The legacy Call Genius application didn’t permit the search of CDR BLOBS.

    Now IDT utilizes Splunk to index all CDR fields. No need to decide what fields to index and cross platform searches are easy without losing specific platform CDR format resolution. There is no longer a need to create BLOBs for efficiency. Engineers and support staff are able to quickly search for any combination of

    • Phone Number
    • IP address
    • Trunk Group Name

    Splunk naturally and easily links search terms across fields and the users just need to enter the phone number or IP and get back the CDR events and transactions.

    Comparing Splunk to the RDBMS solution IDT found searches to be 50 to 100x faster on non-indexed RDBMS data. Indexed fields are also faster in Splunk than in the previous RDMBS solution. Splunk load times for a typical sample average 1 to 5 minutes versus the 20-40 minutes for the RDBMS.

    IDT is in the process of feeding firewall, security, router, IP network, and switch data in into Splunk as well. They’re already discovering Splunk is finding errors not captured by Network Management Consoles and has provided valuable troubleshooting during recent datacenter migrations.

    Most of all IDT is looking forward to discovering new ways to use all the data in Splunk. Heuristic analysis and Business intelligence applications are on the top of their list including the use of Splunk to find human “Family and Friends” networks and drive the development of new commercial programs.

    Splunk 4 Lands in the Southwest

    Last week we continued our road show launching Splunk 4 through the Southwestern US in Phoenix, San Diego and Los Angeles.This was our second annual gathering of customers, partners and users and we had more than double the attendees at this year’s Splunk Live events. In the morning we held a three-hour hands on technical workshop. Attendees had the opportunity to install and configure Splunk 4 on their laptops or remote server and get one-on-one assistance from the Splunk team. Afternoon sessions and dinner focused on customer presentations. We’re very grateful to all the presenters who took time out of their busy days to share with everyone how Splunk is transforming their IT environments. I captured some notes from the week and thought I’d share them with you.

    Early Warning

    In Phoenix we had a packed house at the Sanctuary conference center on the side of Camel Back Mountain. At 109 degrees I decided against hiking up it in the early AM. Dave Bridgeman, Data Security Engineer at Early Warning kept things cool showing the audience how his company’s use of Splunk in their security operations center. Early Warning collaborates with major financial services companies to facilitate fraud detection through shared information and knowledge in cross-institution environments. The company has an interesting history having spun out of First Data and is now primarily owned by Bank of America, BB&T, JPMorgan Chase and Wells Fargo.

    Dave is a well rounded IT professional who started as a developer then moved into network and security management. He current leads the data security team for Early Warning. The environment he over sees includes a variety of platforms including AS400s, MP300s, AIX, Solaris, Linux and Windows. He uses a combination of Splunk forwarders and syslog forwarders to collect Java and Cobol application logs and FTP/SFTP networking logs.

    The Early Warning Splunk installation is designed to track transactions and users from one bank to the next in cross-institution activities. Transaction ID tracing correlates events across applications and services and Splunk alerts the team when jobs fail so the operations and development teams can securely troubleshoot issues on the fly. And remote accessibility mean no more driving into the office to access locked down servers in the middle of the night. On the security side of things Splunk helps Dave’s team track and monitor known fraudsters and bad user names allowing them to stay vigilant when monitoring external attacks. They also use Splunk to deliver reports for customers, executive committee members and the Security Advisory Committee (with representatives from the founding banks).

    Amkor

    Henry Grant of Amkor a $2.1B provider of packaging/assembly and testing services for the semiconductor industry also presented an overview of how his Corporate Data Center team uses Splunk. Henry overseas operations for the company’s SAP, PLM, Supply Chain, Hyperion and Oracle systems. Amkor has a heterogeneous environment of Sun Solaris, IBM iSeries, Cisco ASA firewalls, packaged and custom web and J2EE applications and TACAS/Radius accounting and access control technologies. With manufacturing locations in China, Japan, Korea, Taiwan, Singapore and The Philippines and headquarters in Chandler, AZ, the Amkor team is challenged with log and event data overload. GBs of data a day generated at multiple points makes operational troubleshooting and security investigations extremely complex.

    SOX Compliance

    Proving SOX compliance has traditionally been handled by writing and maintaining scripts to collect and report on errors, access controls and log access activities. It was impossible to segregate duties given the lack of access control to the logs and events themselves. Splunk has taken the place of the awkward script writing and maintenance to collect iSeries, Unix and application events and logs and provide automated schedule reports. The team is now expanding the Splunk footprint to handle network and Oracle logs as well.

    Application and System Monitoring

    Like most enterprise IT shops, Amkor has figured out that traditional point monitoring tools aren’t enough as they have a hard time scaling to all the modern day technologies, require intrusive agents and only work for known events but don’t handle anomalies and unknowns. Too many issues end up being reported by end users themselves rather than the monitoring systems. With Splunk Henry’s team detects event anomalies in real time and has dramatically cut their response time by hours per incident.

    Tools for the Help Desk

    Sometimes it’s the simple things that can cut your response time, escalations and IT budget. The Amkor team noticed a lot of calls and emails regarding VPN set-up and access across the company. With Splunk level 1 help desk agents are now able to resolve most of the VPN issues without creating an escalation. Henry’s team built a VPN dashboard driven by a series of searches and reports that gives entry level help desk personnel the insight they need to troubleshoot problems right away.

    Henry’s Splunk Tips

    The best part of Henry’s overview were the tips for a successful Splunk implementation. I’ve included the list here in hopes that these may help you as well.

    • Provide training that caters to each group’s need.
    • Utilize the deployment Server.
    • Develop a Common Information Model.
    • Update and change as needed.
    • Use Tagging to Normalize Data.
    • Monitor Scheduled Compliance Reports by using the Audit Logs.
    • Splunk into your processes where possible.
    • Setup Test/Dev Environment and a Test/Dev Index .

    Intuit Consumer Group

    The Intuit team of Jeff Ludwig, Chief Architect and Larry Raab, Architect of the Consumer Group joined us to share how use Splunk in production support operations. Jeff leads the Consumer Group’s Connected Services Development for electronic and print tax and payroll filings for TurboTax, ProSeries, Lacerte and QuickBooks. Larry speciali a large-scale, highly available application and systems architect responsible for the consumer group applications and infrastructure.

    While the original use for Splunk at Intuit was application management, Jeff and Larry covered three additional ways they have applied Splunk including reliable monitoring, improving user experience and large-scale reporting for compliance and business intelligence.

    Splunk Live London - Awesome

    I’m finally getting my head above water after a tireless run up to and hectic week launching Splunk 4. The highlight of the launch for me was Splunk Live London. IMHO Splunk Live London 2009 was unrivaled as the most outstanding Splunk event yet.
    We came up with this idea of getting local customers together as a way to launch Splunk 2 in June 2007. Five of us Splunkers sprinted between eight different cities in two weeks to share what was new and encourage users to exchange stories of how searching their data centers was changing life for the better. Its an exhausting way to launch a new product, but it worked so well we’ve integrated Splunk Live events into the mainstream way we do business and interact with our community. I’ve long since lost count of the number of Splunk Lives we’ve conducted all over the world including places like Cape Town, Johannesburg, Beijing, Tokyo, Singapore, Bangkok, Sao Paulo and yes once again in London.



    This year’s London Splunk Live was really special. The event occurred during our launch of Splunk 4 and surpassed our expectations as the largest event we’ve ever held. More than 100 customers and users attended at the Cumberland Hotel and their swank conference facility, complete with a business canteen like breakfast experience, near Marble Arch in West London.

    But the dominant reason to attend any Splunk Live are the presentations and round tables with forward thinking IT professionals who are using Splunk to transform the way they manage IT. This year we were very fortunate to have three Splunk customers who took time out of their busy schedules to come to London and share their experiences with us.

    Accenture - Alexander Strobl, Technical Consultant

    Alexander has been a visionary inside Accenture bringing the power of IT Search to enterprise clients in Germany where he works for Accenture as a Technical Consultant in the Data Center Technology and Opeations team. Alexander is responsible for analysis, design, roll out of Splunk. His most recent Splunk project was with a large worldwide services company with more than 50,000 employees on three continents operating mail order, distribution, e-commerce and over-the-counter-retail trade. Accenture implemented Splunk to transform the management of several technologies including Linux, virtualization and large-scale storage systems.

    The project was part of an IT project to reduce the time to triage problems and improve quality of service. Challenges were:

    • no centralized access to logs and events,
    • critical IT data was stored on local file systems which were copied to central storage only once a day,
    • manual processes to locate errors,
    • no correlation between events on different services/servers and
    • development time was spend building workarounds rather than working on revenue generating applications.

    All of this resulted in complex and time consuming analysis and end the end long MTTR.

    The Accenture Splunk installation is currently indexing ~50GB/day including custom application files and events from 10+ integrated business critical applications and services. There are two Splunk indexes; one for testing and one for production environments and the team has established interfaces between Splunk and several other legacy data center tools.

    Telenor - Henrik Strøm, Security Architect

    Telenor is Norway’s largest ISP, Mobile Operator and Telco. Its one of the largest mobile operators in the world, with 160+ million customers and was founded in 1855 - 154 years ago. The company has 13.000 employees in Norway and 26.000 abroad. Telenor has been rolling Splunk out for centralized log collection and management using Syslog to forward data where it is already in place and using Splunk as a forwarder for new systems and systems with complex multi-line and/or XML structures Syslog can’t handle. Sources of data handles by Splunk include:

    • application logs (Web, Email, IPTV)
    • data center logs (server, network, storage and firewall)
    • IP backbone logs

    Use cases include what Henrik refers to as digging, dashboards baselines, alerting and reporting. One of the best “digging” examples Henrik mentioned was identifying Unix Kernel Errors over the last 30 days. This kind of information routinely went unnoticed prior to Splunk’s arrival.

    Another powerful use case explained by Henrik was how to baseline what is normal in your environment. For example, how many errors do you have on average for a particular type of device (routers, servers, specific applications, etc). Splunk was used to baseline normal Linux kernel behavior and found roughly 20 kernel errors per Linux running instance every 15 minutes.

    The base line then allows the team to schedule simple searches to look for deviation from the baseline and send out alerts before downtime occurs from these hidden sways in behavior. In one case Splunk found thousands of errors occurring on a specific type of device, where the normal baseline was around 20!

    The Telenor team also uses Splunk to identify and report on security situations that may impact their customer facing network and services. Because they are able to easily compose dashboards showing for example which Web servers are under attack and who is attacking them all in one place, the team saves Telenor from potential downtime, performance degradation or theft of data due to attacks they’ve not seen before and are missed by existing security policies and technologies.

    Vodafone - Paulo de Carvalho, Network Services Manager

    Paulo de Carvalho has been using Splunk at Vodafone for almost two years now. His presentation titled “Freeing Information from Organizational Silos” lifted the idea of leveraging logs and IT data out of the realm of just system administration into a thirst for higher level intelligence that crosses not only IT but also business functions. Paulo started by describing the current service oriented architecture (SOA) at Vodafone and how attempts to objectize and re-use capabilities creates incredible complexity among the services, technologies, processes, tools and people.

    Splunk Live San Francisco. It’s about time.

    Last night we hosted more than 100 people at our first ever Splunk Live in San Francisco. It was about time. In May 2007 we started our first series of Splunk Live events. We’ve traveled all around the world from Santa Clara, Los Angeles, Phoenix, San Diego, Dallas, Chicago, New York, Washington DC, Atlanta, London, Zurich, Singapore, Taipei, Shanghai, Bejing, Bangkok and Hong Kong. But never have we had an event in our own backyard. Congratulations to Steve Sommer and our Marketing Team for pulling it off.

    The event took place in our new offices at 2nd and Brannan Street.

    Little known fact that for the first two years at Splunk we actually never had an office of our own but squatted in the offices of venture capitalists and other start-up companies like Six Apart. Having a conference room called “BIG” where we can actually fit more than 100 people still takes some getting use to.

    The best part of course to every Splunk Live are the customer presentations. Last night we were honored to have three local customers show everyone how they are using IT Search.

    • Mashery, The leading provider of API management services enabling companies to easily leverage web services as a distribution channel, discussed how they use Splunk to power self-service reporting for their customers on activity within their hosted, cloud-based services.
    • Lawrence Livermore National Labs LLNL, a US Dept of Energy national lab talked about their Splunk deployments in multiple groups and data centers addressing a wide range of needs, from application availability to meeting FISMA security regulations. They drive a range of initiatives from high performance computing to nuclear weapons development to running particle accelerators.
    • Visa International- The world’s largest retail electronic payments network, and one of the most recognized global financial services brands, will share how they use Splunk for network security monitoring and incident response.

    Stay tuned to our events page for more upcoming Splunk Live events next year. We plan to visit several cities each quarter and will likely be in your neighborhood at some point in the near future.





    Human and Machine Language Mashups at Splunk Live Zurich, Switzerland

    At Splunk Live in Zurich this week an interesting discussion erupted about human and machine languages. Before I continue with the story, I want to thank everyone that attended the event. Despite the fact that Raffy Marty is a resident celebrity, this was our first formal customer and partner event in Switzerland. We had more than 50 people attend for several hours to talk about Splunk and data center management challenges. The event was co-hosted by T-Systems.

    Thank you Meno Schnapauff for your great presentation on how T-Systems and the Swiss National Railway are using Splunk!

    Other attendees included folks from Swisscom, Unicom Consulting, Rothschild Bank, Genossenschaft Migros, LeShop, Netcetera, Cablecom GmbH, TBK-Patent Munich, On Line Video 46, Skyguide, PostFinance and the Univestity of Fribourg. Brian Haynes, Tim Thorpe, Julie Duncan and Hash Basu-Choudhuri from our London office participated too.

    Now part of the reason I mention all these names (in addition to thanking folks) is to the point of this post. In the room we had an American (me), several native English speakers from different areas of England, Swiss German speakers from Switzerland and German speakers from Germany. What I noticed is how two people think they speak the same language but can’t always understand each other. It turns out there are a lot of American (some West Coast) colloquialisms I use that my “queens English” counterparts don’t understand. And of course most of the time I try to make a joke the Swiss and Germans just look at me like I’m from outer space even though if you asked them they’d say they speak fluent English. During the event the Swiss Germans had trouble understanding the Germans and the Germans had trouble understanding the Swiss Germans. The folks from the UK who spoke German didn’t understand either the Swiss German or the German German although they all claim to speak German.

    What does all this have to do with IT you ask? Well it turns out that mashing up languages and attempting to understand each other even though we don’t speak exactly the same language is one of the biggest problems we have in trying to understand our IT systems as well.

    “One of the questions posed at the event was how can I modify my system and application logging to some standard in order to follow what my systems are doing? Do we need a logging standard?”

    I have long been telling people that logging standards are a waste of time. IBM’s Common Base Events (CBE) has been around for decades and has very little traction in the real world. Data Center Mark-up Language (DCML) was pushed by Opsware and lots of smart people. It got nowhere. Logs exist. Instrumentation exists. Our IT systems already have tremendous amounts of data. Trying to retrofit that data to some standard is impossible. Attempting to organize a multi-vendor logging standard will never happen. Getting developers to log consistently sounds great but I’ve never seen it done before.

    What we need is a mashup of machine languages and logging formats. That’s exactly what IT Search is!

    Humans need to stop thinking about how we can format data to make it easier for machines to work with it. There is too much data. The real value is being about to work with massive amounts of data without any human intervention. This is exactly what Google does for the web. Sure you can reformat your HTML to get better search results. But even if you do nothing Google will index your site. You don’t even have to tell Google to do it!

    I’m going to start sharing more of our experiences helping people see the connections that already exist in their logging data. While the connections are not always obvious to the naked eye and human linear thinking, machines are great at teasing out non-obvious relationships. This is perhaps the most compelling thing we work on at Splunk and continue to push the bleeding edge of what’s possible.

    The Splunk Platform Has Launched

    Without a doubt the past week has been the most amazing week in Splunk history. The crazy coast to coast multi-city launch left us all exhausted and electrified. A few of the things that stick in my mind…

    First Splunk 3.2 including Splunk for Windows went live on our download page last Saturday and more than 40% of our downloads in the past week have been for our new Windows version. Then Nick Selby of 451 Group wrote an analyst brief on us. He said, “Splunk is awesome: it’s multiplatform, easy to install and easy to use. And with an abstraction layer of logs, configuration files and system messages, traps and alerts, it’s seriously useful.” 451 has a reputation for ripping vendors, so we’re flattered.

    Dana Gardner, analyst with Interarbor wrote a very eloquent analysis of our platform launch on ZD Net. “Splunk has created the means to offer developers easy access to that data and the powerful inferences gleaned from comprehensive IT search. That means the data can go places no log file has gone before,” says Dana. Developers are certainly doing some way cool things with Splunk.

    I’ve seen a couple of neat visualization applications including this one called Replay. It shows you a live or time lapsed view of your event streams. Here you can see the replay application hooked up to our internal wiki showing who’s doing what over a 24 hour period. Click on the image for the movie.

    replay.png

    As for our own applications, the Splunk for PCI app drew tremendous interest at our series of Splunk Live events this past week. It’s just one example of how a business person with domain knowledge can package their own Splunk configuration as an application. If you haven’t seen Raffy’s video on the PCI Application, check it out here.

    pci.png

    We also showed the Splunk for Change Management application as well. Seeing someone touch a file and watching the Splunk dashboard update instantaneously is an awesome display of how flexible Splunk has become. Check out the developer program for yourself and get your goods up on SplunkBase so we can all check em out.

    changemgmt.png