Distributed Search

In this post I will be talking about a feature of Splunk that got turbo charged for 4.0 : Distributed Search.

Splunk is a great tool when it’s just running on a single system but distributed search has some great advantages.

  • Provides completely different views into the same data by having different apps on different systems.
  • Allow leveraging of map reduce architecture to run complex queries.
  • Linearly scale Splunk indexing by simply adding more servers.

Terminology Used:

  • Search Head : The splunk instance that the user logs into and distributes searches from.
  • Search Peer : A splunk instance that receives search requests from the search head.
  • $SPLUNK_HOME : The root of your splunk install, this environment variable will be automatically
Splunking pitchfork album reviews

One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex …

Auto host resolving in splunk using python

This only works in 2.0.x
Ok so I’ve had a couple of people ask me how to resovle the ip addresses in their syslog files to their hostnames in splunk.
There’s no way to do this just by tweaking a config variable .. we need to dig a little deeper under the surface. It’s actually pretty easy to get splunk to call out to python during event processing so I’ve used that functionality to solve this problem.

Note that this will negatively impact indexing performance but it should work until we get this behavior baked into splunk.

First up I’ve created a python script that calls socket.gethostbyaddr to resolve the hosts. It will also cache the results so that the …

Splunk Cheat Sheet !

I’ve been pretty busy so I haven’t updated for a while but I thought I should share this :
Corey Shields has made a great splunk cheat sheet ! It’s available at : http://staff.osuosl.org/~cshields/?p=140
It’s pretty awesome, and I’m recommending that everyone I know that uses splunk downloads it.
