The SSL Performance Odyssey

When you come to dev.splunk.com, you see pictures of beer pong, full bars, stuffed ponies with fart machines taped to their ass, etc - basically engineers gone wild. Somewhere between all of this insaneness, we actually find the time to write code and solve problems like this one.This post is all about a crazy-weird performance issue that we were experiencing, how it manifested itself and ultimately how it was fixed.

I suspect others may be having this problem, as the problem lives in some very popular open source code as far as I can tell. With that, I’ll begin telling you about my journey into hell.

Splunk has a home grown embedded HTTP(S) server that serves up all external interfaces to the ’splunkd’ daemon. We use it as the core engine for our REST and XML/RPC-like API’s. The GUI and the CLI both end up talking to the daemon via this server.

When I wrote the core of it a few months ago, I ran some rudimentary performance tests on several platforms and it seemed decent enough for our use, but a week ago, the manager of the Search and Indexing team (Stephen) said that he was seeing abysmal performance using SSL. He said that the GUI performance was being impacted. I didn’t believe him and insisted that it was something else and that he was high.

Diagraming Splunk’s data-flow (part 2 - performance overlays)

In my previous post “Diagraming Splunk’s data-flow” I wrote a small python script that parsed Splunk’s runtime environment ($SPLUNK_HOME/var/run/splunk/composite.xml) and generated a file which when input into graphviz would generate a nice architectural diagram of how pipelines and processors are wired together.

In this installment, I took it to the next level by using Splunk’s search capability to overlay performance metrics on the diagram. The combination of Splunk logging metrics information for each processor within each pipeline (thanks Brad) and the ability to have Splunk execute a search processor written in Python made this possible. Here is how you use it:

First download graphviz. I particularly like the OSX application that they’ve written because you can see the graph on the screen and as the file changes, those changes are reflected in the graph you are viewing. If you don’t have a Mac, use the command line version to generate different types of output file formats like .jpeg, etc.

Go to SplunkBase to download my python script. Copy the .py file into $SPLUNK_HOME/etc/searchscripts

Start Splunk.

Type the following into the search box:index___internal metrics pipeline processor NOT get - over all time - localhost - Splunk 3.2-UNSTABLE-4.jpg
This will search for the appropriate metrics information and pipe the results through the script.

Diagraming Splunk’s data-flow

This blog entry is not about how the framework works. It is about a semi-cool visualization that I created using python and graphviz. If you watched the video where I presented Splunks framework architecture from a high level you know what pipelines and processors are. If you haven’t here is a very quick overview.

  • A pipeline is a thread of execution that lives within the splunkd process. Each pipeline executes a series of processors, each one which operates on data. The data is created when the first processor on the pipeline reads it from some input (like tailing a file, or receiving it on a network port). Each processor then does something to the data. Eventually, the data gets indexed and execution is returned to the first processor to get more data again.
  • Pipelines are connected via queues. A queue output processor (the last processor in a pipeline) puts data on to a queue and blocks if the queue is full. A queue input processor (the first processor at the top of a pipeline) gets the data item from the bottom of the queue and sends it on down the pipeline. If there is no data, it blocks waiting for some to be put on the queue.

The framework team is hiring

Splunk’s framework team is involved in many diverse projects. The “framework” itself is really a set of generic code that makes up the runtime environment of Splunk. In addition, we also handle bringing data into the system, distributing this data across enterprise topologies, authentication, access controls, configuration management, distributed deployment, high availability, real-time streaming, encryption and much much more.

Splunk is extending it’s reach into extremely large deployments involving thousands of machines and devices across multiple data centers. The framework team is responsible for making Splunk excel in these challenging environments. If this sounds interesting and you want to work with some extremely talented people, please drop me some email.

Framework Architect / Senior Engineer

We are looking for a highly motivated engineer who will be responsible for driving the design and implementation of Splunk’s network management, scalability, and distributed deployment technology. The right candidate is fluent in C++, high performance networking and concurrent / multi-threaded design.

Qualifications

  • Minimum 5 years of relevant industry experience
  • Expert C++ knowledge, deep understanding of design patterns and experience building clean external API’s.
  • Significant experience with multi-threaded design and implementation
  • Has designed & implemented high throughput server systems

Software configuration - why does this wheel need re-invention?

I have worked on so many software projects that I can’t possibly enumerate them. Most of my contribution to these projects has been on the server side of things. Every one of these projects needed to be configured in some way, shape or form and I just realized that every one of them had it’s own configuration subsystem that was implemented from scratch. Many of these configurations could be managed via GUI’s and/or CLI’s, and others simply were “managed” via vi, or emacs. They all share one thing in common however - they all suck in one way or another. Why? Because configuration subsystems are incredibly difficult to get right.

Building a configuration system on the surface seems boring. If I went and showed the sales guys how cool my configuration system was they would roll their eyes back into their heads. Put some rotating, flashing thing on the GUI and they think you’re the coolest, most creative developer around. The fact is that a good configuration system makes a huge difference to a product. In fact, it can make or break it in some cases.