thebaumblog: Archive for August, 2009

Splunk 4 Down Under

I visited Sydney and Melbourne last week to host our first Splunk Live events in Australia. Its my first visit to Australia and I’m really blown away by the friendliness of the people we’ve met. And the “Australian for Grep” t-shirt finally had a proper home. Attendees at today’s event in Melbourne and Tuesday’s event in Sydney included an impressive list of current customers and partners and a number of new users evaluating Splunk for the first time including Telstra, Ericsson, InfoSys, Frontline Systems, Fujitsu, GE Capital Finance, Toll Holdings, Vanguard Investments and more. We owe a huge thanks to the team from Digital Networks Australia who sponsored the two events.

Martin Brown, A Large Australian Financial Services Company

In Sydney Martin Brown, pictured below with me, gave an excellent presentation on using Splunk for Identity Management Compliance. Martin is a Technical Architect managing the development and operations of the world wide web application security system‏ for a major financial institution. He’s had many career evolutions from implantable device electronics and software engineering, UNIX and network systems administration, internet systems management and security.

Martin’s company has a requirement for presenting client security history from their web applications and to be able to access this information to look for suspect IDs from the past six months. Tivoli Access Manager (TAM) is used for both external and internal identity management and access control. More than 200,000 clients authenticate externally through TAM.

His Splunk deployment is very much out of the box with a range of saved searches and some role partitioning. It consists of a single Splunk server with 1TByte of local disk for retention. The TAM logs are rsynced regularly and directly mounted from various hosts and systems. 12 internal and 12 external TAM hosts generate 5 GB/day of data or ~2TB of data a year.

The current user base consists of business second level support teams and TAM support group for third level support. The user bases is expected to extend to the Risk Management Group and first level help desk support soon. Their classic use case is

“Client X’s account has been compromised. What applications has he/she logged in to in the past 6 months?”

The old way required days / weeks of work and support from multiple teams. Often needed to pull in log files from offsite backup tapes then grep through GBytes of data from several hosts. Fun fun. Now with Splunk Martin’s team finds answers in minutes and soon will train Tier 1 agents to do the same, eliminating the hassle of Martin’s team fetching data for everyone. Next he plans to add App server, Web Server and Load Balancer data, role partitioning to restrict business user access to relevant logs, off-shore implementations to present local application logs, API consumption for helpdesk one-stop-shop interface.

Nick Clark, Ericsson

Nick Clark is a Technology Manager in the Solution Management & Utilities Consulting, System Integration & Multimedia practice with Ericsson where the focus is on bespoke support and life cycle management services for complex infrastructures. His group focuses on mobile and fixed network infrastructure, telecom services, software, broadband and multimedia solutions for operators, enterprises and the media industry. He presented his Splunk solution which Ericsson implemented at Telstra in the mobile multimedia services area to troubleshoot problems and investigate incidents. The solution was initially implemented to provide coverage of the 2008 Beijhing Olympics. Telstra predicted massive interest for mobile streaming yet demand exceeded all expectations. Splunk helped Ericsson and Telstra quickly pinpoint, manage and address problems. Because application failures and limits were discovered before they cause serious downtime Telstra maintained an uptime above 99.9% during the Olympic Games.

Telstra manages more than 10M users and 50 plus content providers on the Telstra Service Delivery Platform providing multiple mobile portals, content transformation, mobile streaming services and device specific rendering and UI over 2G and 3G networks. The environment consists of 60+ servers (Solaris 9/10, Windows 2003) and many platforms and technologies providing service orchestration, rich media content management, encoding and streaming for terabytes of active content.

Ericcson and Telstra’s challenges before Splunk were numerous including:

  • no central view of logs and events resulting in difficult to troubleshoot problems,
  • support and operations diverted to log fetching and ad-hoc reporting delaying work on high priority projects,
  • no consistent approach to log handling and storage making it difficult to locate, access and archive logs and
  • poor visibility of service and transaction flows extending outages.

The Ericsson team chose Splunk to help Telstra gain a holistic view of the environment, troubleshoot outages more quickly, provide users with ad-hoc reporting and control access to logs with by role. They are currently indexing roughly 20GB per day on a dual processor, dual core Xeon GHz server with 16GB of RAM. 30 support people (tier 1 and up) currently Splunk application, server and network logs and events to troubleshoot problems. The team makes extensive use of Splunk tagging to create alerts for future notification of problems reoccurring. Perhaps the most valuable thing Ericsson has done with Splunk is track end to end transactions on the Service Delivery Platform. With one view across all services and transactions to track activities the team can finally provide transaction level alerting and reporting.

Thank you again to Nick and Martin for presenting so well and Monsour, Martin and Sky with DNA who did a fantastic job and are representing Splunk very well down under.

Splunk 4 Lands in the Southwest

Last week we continued our road show launching Splunk 4 through the Southwestern US in Phoenix, San Diego and Los Angeles.This was our second annual gathering of customers, partners and users and we had more than double the attendees at this year’s Splunk Live events. In the morning we held a three-hour hands on technical workshop. Attendees had the opportunity to install and configure Splunk 4 on their laptops or remote server and get one-on-one assistance from the Splunk team. Afternoon sessions and dinner focused on customer presentations. We’re very grateful to all the presenters who took time out of their busy days to share with everyone how Splunk is transforming their IT environments. I captured some notes from the week and thought I’d share them with you.

Early Warning

In Phoenix we had a packed house at the Sanctuary conference center on the side of Camel Back Mountain. At 109 degrees I decided against hiking up it in the early AM. Dave Bridgeman, Data Security Engineer at Early Warning kept things cool showing the audience how his company’s use of Splunk in their security operations center. Early Warning collaborates with major financial services companies to facilitate fraud detection through shared information and knowledge in cross-institution environments. The company has an interesting history having spun out of First Data and is now primarily owned by Bank of America, BB&T, JPMorgan Chase and Wells Fargo.

Dave is a well rounded IT professional who started as a developer then moved into network and security management. He current leads the data security team for Early Warning. The environment he over sees includes a variety of platforms including AS400s, MP300s, AIX, Solaris, Linux and Windows. He uses a combination of Splunk forwarders and syslog forwarders to collect Java and Cobol application logs and FTP/SFTP networking logs.

The Early Warning Splunk installation is designed to track transactions and users from one bank to the next in cross-institution activities. Transaction ID tracing correlates events across applications and services and Splunk alerts the team when jobs fail so the operations and development teams can securely troubleshoot issues on the fly. And remote accessibility mean no more driving into the office to access locked down servers in the middle of the night. On the security side of things Splunk helps Dave’s team track and monitor known fraudsters and bad user names allowing them to stay vigilant when monitoring external attacks. They also use Splunk to deliver reports for customers, executive committee members and the Security Advisory Committee (with representatives from the founding banks).

Amkor

Henry Grant of Amkor a $2.1B provider of packaging/assembly and testing services for the semiconductor industry also presented an overview of how his Corporate Data Center team uses Splunk. Henry overseas operations for the company’s SAP, PLM, Supply Chain, Hyperion and Oracle systems. Amkor has a heterogeneous environment of Sun Solaris, IBM iSeries, Cisco ASA firewalls, packaged and custom web and J2EE applications and TACAS/Radius accounting and access control technologies. With manufacturing locations in China, Japan, Korea, Taiwan, Singapore and The Philippines and headquarters in Chandler, AZ, the Amkor team is challenged with log and event data overload. GBs of data a day generated at multiple points makes operational troubleshooting and security investigations extremely complex.

SOX Compliance

Proving SOX compliance has traditionally been handled by writing and maintaining scripts to collect and report on errors, access controls and log access activities. It was impossible to segregate duties given the lack of access control to the logs and events themselves. Splunk has taken the place of the awkward script writing and maintenance to collect iSeries, Unix and application events and logs and provide automated schedule reports. The team is now expanding the Splunk footprint to handle network and Oracle logs as well.

Application and System Monitoring

Like most enterprise IT shops, Amkor has figured out that traditional point monitoring tools aren’t enough as they have a hard time scaling to all the modern day technologies, require intrusive agents and only work for known events but don’t handle anomalies and unknowns. Too many issues end up being reported by end users themselves rather than the monitoring systems. With Splunk Henry’s team detects event anomalies in real time and has dramatically cut their response time by hours per incident.

Tools for the Help Desk

Sometimes it’s the simple things that can cut your response time, escalations and IT budget. The Amkor team noticed a lot of calls and emails regarding VPN set-up and access across the company. With Splunk level 1 help desk agents are now able to resolve most of the VPN issues without creating an escalation. Henry’s team built a VPN dashboard driven by a series of searches and reports that gives entry level help desk personnel the insight they need to troubleshoot problems right away.

Henry’s Splunk Tips

The best part of Henry’s overview were the tips for a successful Splunk implementation. I’ve included the list here in hopes that these may help you as well.

  • Provide training that caters to each group’s need.
  • Utilize the deployment Server.
  • Develop a Common Information Model.
  • Update and change as needed.
  • Use Tagging to Normalize Data.
  • Monitor Scheduled Compliance Reports by using the Audit Logs.
  • Splunk into your processes where possible.
  • Setup Test/Dev Environment and a Test/Dev Index .

Intuit Consumer Group

The Intuit team of Jeff Ludwig, Chief Architect and Larry Raab, Architect of the Consumer Group joined us to share how use Splunk in production support operations. Jeff leads the Consumer Group’s Connected Services Development for electronic and print tax and payroll filings for TurboTax, ProSeries, Lacerte and QuickBooks. Larry speciali a large-scale, highly available application and systems architect responsible for the consumer group applications and infrastructure.

While the original use for Splunk at Intuit was application management, Jeff and Larry covered three additional ways they have applied Splunk including reliable monitoring, improving user experience and large-scale reporting for compliance and business intelligence.

Splunk Live London - Awesome

I’m finally getting my head above water after a tireless run up to and hectic week launching Splunk 4. The highlight of the launch for me was Splunk Live London. IMHO Splunk Live London 2009 was unrivaled as the most outstanding Splunk event yet.
We came up with this idea of getting local customers together as a way to launch Splunk 2 in June 2007. Five of us Splunkers sprinted between eight different cities in two weeks to share what was new and encourage users to exchange stories of how searching their data centers was changing life for the better. Its an exhausting way to launch a new product, but it worked so well we’ve integrated Splunk Live events into the mainstream way we do business and interact with our community. I’ve long since lost count of the number of Splunk Lives we’ve conducted all over the world including places like Cape Town, Johannesburg, Beijing, Tokyo, Singapore, Bangkok, Sao Paulo and yes once again in London.



This year’s London Splunk Live was really special. The event occurred during our launch of Splunk 4 and surpassed our expectations as the largest event we’ve ever held. More than 100 customers and users attended at the Cumberland Hotel and their swank conference facility, complete with a business canteen like breakfast experience, near Marble Arch in West London.

But the dominant reason to attend any Splunk Live are the presentations and round tables with forward thinking IT professionals who are using Splunk to transform the way they manage IT. This year we were very fortunate to have three Splunk customers who took time out of their busy schedules to come to London and share their experiences with us.

Accenture - Alexander Strobl, Technical Consultant

Alexander has been a visionary inside Accenture bringing the power of IT Search to enterprise clients in Germany where he works for Accenture as a Technical Consultant in the Data Center Technology and Opeations team. Alexander is responsible for analysis, design, roll out of Splunk. His most recent Splunk project was with a large worldwide services company with more than 50,000 employees on three continents operating mail order, distribution, e-commerce and over-the-counter-retail trade. Accenture implemented Splunk to transform the management of several technologies including Linux, virtualization and large-scale storage systems.

The project was part of an IT project to reduce the time to triage problems and improve quality of service. Challenges were:

  • no centralized access to logs and events,
  • critical IT data was stored on local file systems which were copied to central storage only once a day,
  • manual processes to locate errors,
  • no correlation between events on different services/servers and
  • development time was spend building workarounds rather than working on revenue generating applications.

All of this resulted in complex and time consuming analysis and end the end long MTTR.

The Accenture Splunk installation is currently indexing ~50GB/day including custom application files and events from 10+ integrated business critical applications and services. There are two Splunk indexes; one for testing and one for production environments and the team has established interfaces between Splunk and several other legacy data center tools.

Telenor - Henrik Strøm, Security Architect

Telenor is Norway’s largest ISP, Mobile Operator and Telco. Its one of the largest mobile operators in the world, with 160+ million customers and was founded in 1855 - 154 years ago. The company has 13.000 employees in Norway and 26.000 abroad. Telenor has been rolling Splunk out for centralized log collection and management using Syslog to forward data where it is already in place and using Splunk as a forwarder for new systems and systems with complex multi-line and/or XML structures Syslog can’t handle. Sources of data handles by Splunk include:

  • application logs (Web, Email, IPTV)
  • data center logs (server, network, storage and firewall)
  • IP backbone logs

Use cases include what Henrik refers to as digging, dashboards baselines, alerting and reporting. One of the best “digging” examples Henrik mentioned was identifying Unix Kernel Errors over the last 30 days. This kind of information routinely went unnoticed prior to Splunk’s arrival.

Another powerful use case explained by Henrik was how to baseline what is normal in your environment. For example, how many errors do you have on average for a particular type of device (routers, servers, specific applications, etc). Splunk was used to baseline normal Linux kernel behavior and found roughly 20 kernel errors per Linux running instance every 15 minutes.

The base line then allows the team to schedule simple searches to look for deviation from the baseline and send out alerts before downtime occurs from these hidden sways in behavior. In one case Splunk found thousands of errors occurring on a specific type of device, where the normal baseline was around 20!

The Telenor team also uses Splunk to identify and report on security situations that may impact their customer facing network and services. Because they are able to easily compose dashboards showing for example which Web servers are under attack and who is attacking them all in one place, the team saves Telenor from potential downtime, performance degradation or theft of data due to attacks they’ve not seen before and are missed by existing security policies and technologies.

Vodafone - Paulo de Carvalho, Network Services Manager

Paulo de Carvalho has been using Splunk at Vodafone for almost two years now. His presentation titled “Freeing Information from Organizational Silos” lifted the idea of leveraging logs and IT data out of the realm of just system administration into a thirst for higher level intelligence that crosses not only IT but also business functions. Paulo started by describing the current service oriented architecture (SOA) at Vodafone and how attempts to objectize and re-use capabilities creates incredible complexity among the services, technologies, processes, tools and people.