brian: Homepage

Splunking pitchfork album reviews

One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex :

.*?<h2 class=”fn”>\s*(.*?):<br />([^\n]*)\n.*?<div class=”info”>\n\[([^<;]*);?\s*(\d*)\]?.*?<span class=”rating”>(.*?)<.*?<div class=”content description”>(.*?)</div>.*? - <span class=”reviewer”><span class=”vcard”><span class=”fn”>(.*?)</span>.*?title=”\d+”>(.*?)<

I can now run some interesting queries :

  • * | chart avg(rating) by releaseYear

    Which graphs the average rating per calendar year of the release.
  • *| stats count(title), avg(rating) by artist | search “count(title)”>2| sort “avg(rating)” d | head 10

    This shows the top rated artists that have a least 3 reviews on pitchfork

Auto host resolving in splunk using python

This only works in 2.0.x
Ok so I’ve had a couple of people ask me how to resovle the ip addresses in their syslog files to their hostnames in splunk.
There’s no way to do this just by tweaking a config variable .. we need to dig a little deeper under the surface. It’s actually pretty easy to get splunk to call out to python during event processing so I’ve used that functionality to solve this problem.

Note that this will negatively impact indexing performance but it should work until we get this behavior baked into splunk.

First up I’ve created a python script that calls socket.gethostbyaddr to resolve the hosts. It will also cache the results so that the performance hit for dns misses is reduced.
So copy and paste the following into your favorite editor and save it to <SPLUNK_HOME>lib/python2.4/site-packages/splunk/pyHostNameResolve.py . This directory is where the dynamic loaded python will look for scripts; the filename will be referenced later in a config change.

Splunk Cheat Sheet !

I’ve been pretty busy so I haven’t updated for a while but I thought I should share this :
Corey Shields has made a great splunk cheat sheet ! It’s available at : http://staff.osuosl.org/~cshields/?p=140
It’s pretty awesome, and I’m recommending that everyone I know that uses splunk downloads it.
Until next time,
Brian

Splunking from Python Part I

One of the neat things about splunk is that it’s search interface is a SOAP call. In this post I’m going to talk about using the python modules that ship with splunk to talk to splunk over this SOAP interface.
First off you will need to set some environment variables so that you are running the version of python that ships with splunk :


export SPLUNK_HOME=<WHERE_YOU_INSTALLED_SPLUNK>
export PATH=$SPLUNK_HOME/bin:$PATH
export LD_LIBRARY_PATH=$SPLUNK_HOME/lib:$LD_LIBRARY_PATH

Ok so now you should be good to go so fire up python. Your python version should be 2.4.2. If it’s not do a “which python” from the command prompt to make sure you are using the python that shipped with splunk.
We need to do some setup before any searches can be run :


Python 2.4.2 (#1, Mar 11 2009, 21:45:07)
[GCC 4.0.2] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.


>>> import splunk.search.splunkTest #initialize the python internals without using twistd
>>> import splunk.search.SearchCore as SearchCore #This is the module we are going to use to issue searches

If you want to run against a remote splunk server or on different ports you can run the following :


>>> SearchCore.SearchService.gSearchService._searchEngineURL = “http://<remote_host>:<searchengine_port>”

Slow queries and solutions.

Since the launch of the 1.2 product some people are experiencing really slow query times. This is especially noticable when you are running a live splunk pretty often, as this tends to fragment the database quiet a bit.

Fear not as there is a hidden undocumented call that you can make ! If you run the query “++cmd++::optimize” you will cause a database optimization. This call may take a while to return so use with care. Soon we will have a release with an auto-optimizer but if it’s hampering your splunking right now you can create a live splunk to run every 10-30 mins that runs “++cmd++::optimize”.

Laters,

Brian

First Post

First Post !

So this is the start of my splunk blog.

First up I’m splunk employee #1. Way back in Sept. 2004 I joined Erik, Rob and Michael when they were still based down in the VC offices in Palo Alto. I’m responsible for searches and indexing so if you have splunks that are taking WAAAY too long to complete I’m the person that’s probably responsible.

I’ll post more later on what I’m coding, struggling against or just hacking on.

Brian out.