rob: hacks

Diagraming Splunk’s data-flow (part 2 - performance overlays)

In my previous post “Diagraming Splunk’s data-flow” I wrote a small python script that parsed Splunk’s runtime environment ($SPLUNK_HOME/var/run/splunk/composite.xml) and generated a file which when input into graphviz would generate a nice architectural diagram of how pipelines and processors are wired together.

In this installment, I took it to the next level by using Splunk’s search capability to overlay performance metrics on the diagram. The combination of Splunk logging metrics information for each processor within each pipeline (thanks Brad) and the ability to have Splunk execute a search processor written in Python made this possible. Here is how you use it:

First download graphviz. I particularly like the OSX application that they’ve written because you can see the graph on the screen and as the file changes, those changes are reflected in the graph you are viewing. If you don’t have a Mac, use the command line version to generate different types of output file formats like .jpeg, etc.

Go to SplunkBase to download my python script. Copy the .py file into $SPLUNK_HOME/etc/searchscripts

Start Splunk.

Type the following into the search box:index___internal metrics pipeline processor NOT get - over all time - localhost - Splunk 3.2-UNSTABLE-4.jpg
This will search for the appropriate metrics information and pipe the results through the script.

Diagraming Splunk’s data-flow

This blog entry is not about how the framework works. It is about a semi-cool visualization that I created using python and graphviz. If you watched the video where I presented Splunks framework architecture from a high level you know what pipelines and processors are. If you haven’t here is a very quick overview.

  • A pipeline is a thread of execution that lives within the splunkd process. Each pipeline executes a series of processors, each one which operates on data. The data is created when the first processor on the pipeline reads it from some input (like tailing a file, or receiving it on a network port). Each processor then does something to the data. Eventually, the data gets indexed and execution is returned to the first processor to get more data again.
  • Pipelines are connected via queues. A queue output processor (the last processor in a pipeline) puts data on to a queue and blocks if the queue is full. A queue input processor (the first processor at the top of a pipeline) gets the data item from the bottom of the queue and sends it on down the pipeline. If there is no data, it blocks waiting for some to be put on the queue.