thebaumblog: Budget

Splunk Live Princeton 2009

Wednesday and we’re at Splunk Live Princeton, NJ. What an awesome place. Princeton is home to a great university and some great culinary experiences. Check out Mediterra — an interesting mix of Italian and Spanish influences. Apparently it’s where all the Princeton parents treat their kids to dinner when they are in town. Next store to our venue was the great hope for the state of NJ — a new Governor. The current Governor has turned the state budget and tax base into toxic waste. Well things went much better for the more than 60 Splunk Live attendees in Princeton today, who gained insight into how a number of large Splunk customers keep their mission critical applications running in a time of IT budget slash and burn.

Matthew Stevens, Director Software Systems and Architecture at Comcast provides guidance to Comcast executives on mission critical media systems and strategic systems architecture. Comcast is the country’s largest provider of cable services serving 23.9 million cable customers, 15.3 million high-speed Internet customers and 7.0 million Comcast Digital Voice customers.

Comcast Developer Network

Matthew’s latest project is the Comcast Developers Network a Comcast-scale secure web services platform for the development of cool new media and entertainment offerings. The Comcast Web Platform environment generates of billions of software events each day from caching and load-balancing, origin application servers, databases, middleware and content delivery networks for images and video streams. Comcast services demand high quality. Much of the Comcast content is exclusive and premium services drive revenue. Interfaces between technology components (applications, delivery platforms) need to adhere to best practices to ensure the highest degree of end customer experience.

Why Splunk?

Comcast has acquired many system and application management platforms over the years, but nothing was providing the team with the robust information from operational telemetry the teams around the company need to ensure data integrity, stability, application quality and efficiency. Several efforts specifically drove Comcast to consider and deploy Splunk.

  • Product rollout: The team wanted the ability to predict and correct potential issues before going live into into production—Splunk has become a required best practice for new product rollouts.
  • Network/ System Integrity: Understanding security and user experience across a very large network and set of systems is a must to protect the business. Splunk provides the insight the network and system teams need across many different silos of technologies.
  • Business Intelligence: Having immediate access to real-time events and historical trends allows the various Comcast business teams to react quickly and adapt to changing customer behaviors.
  • Agility: Alerts and Dashboards indicate discrepancies so distributed teams can investigate immediately and remediate failures and attacks.

Video CDN/CMS Performance

“In content management systems and delivery networks a devil walks the long tail. If you’re facing concurrent hits across the tail of the curve, sharpen your pencil, you’ve got problems!”

Splunk helps Comcast understand the risks of instability in our systems, especially during periods of high concurrency. Through pre-production modeling of even patterns and subsequent monitoring of these patterns Splunk pays for itself by helping Comcast avoid deployment of vulnerable systems, downtime, and upset customers.

Predicting System Imbalance

Comcast has successfully used Splunk to evaluate potential infrastructure vendor’s solutions and determine if they will balance loads properly across a large, indeterminate infrastructure. Often the answer is no as illustrated here in a Splunk report of resource utilization across various services.

Splunk has also been utilized to see whether solutions will be resilient to different traffic patterns, helping the company perform predictive analysis before making critical infrastructure investments.

Load testing is performed during non peak hours and the results are analyzed for system failures over time using the telemetry data Splunk can correlated across various logs, messages and events.

When failures are found the Comcast team uses Splunk reports to dig deeper into the data.


Security and Compliance

In addition to operations use cases, Comcast security and compliance teams leverage the consolidated logs across data centers to enable faster threat assessment and security monitoring.

  • Monitoring for bad actors to trigger alerts,
  • Conducting threat detection over time,
  • Detecting attacks/vulnerabilities in systems and
  • Auditing systems in support of security assessments and compliance.

What’s Next?

Next up for Matthew and team is the launch of the Comcast CodeBig Platform enabling a network of developers to create content for the network. Some of these developers are already using Splunk in their own managed services like Mashery. Comcast is working to hook the Mashery Splunk installation to their own in-order to provide visibility across multiple services and providers of content and entertainment functionality.

Chris Abboud manages the Enterprise Systems Management team at Dow Jones — monitoring customer facing infrastructure and applications. Dow Jones provides global business news and information services to millions of consumers and enterprise media groups. Keeping these revenue generating services running 7×24x365 is the highest priority. Chris also manages the DJ service management platforms (Remedy, Knowledge Base, etc.) He’s been with the DJ organization for 10 years, in current role for 3 years.

“Our mission is to address issues before they become service impacting events. Failures are going to happen — we need to make sure people know about them as soon as possible.”

The Splunk Set-up

The Dow Jones Splunk installation includes

  • Data from 6000+ servers globally,
  • 13,500 + source types,
  • 1,700 network devices (primarily Cisco and Juniper) and
  • Ten distributed Splunk servers in difference geographies index ~100GB a day and provide a new global logging console.

Why Splunk?

Each Dow Jones command center now has the ability to know what’s happening before customers do across a wide range of internal and external services. Splunk speeds the time to resolution for email outages that may impact internal users’ productivity and editorial sites downtime that can directly impact to customer service and revenue. Dow Jones has found Splunk generates significantly fewer false positives than traditional monitoring systems and new resources are much easier to manage and deploy.