brian: dev

40 Days Of 4.0: Distributed Search

In this post I will be talking about a feature of Splunk that got turbo charged for 4.0 : Distributed Search.

Splunk is a great tool when it’s just running on a single system but distributed search has some great advantages.

  • Provides completely different views into the same data by having different apps on different systems.
  • Allow leveraging of map reduce architecture to run complex queries.
  • Linearly scale Splunk indexing by simply adding more servers.

Terminology Used:

  • Search Head : The splunk instance that the user logs into and distributes searches from.
  • Search Peer : A splunk instance that receives search requests from the search head.
  • $SPLUNK_HOME : The root of your splunk install, this environment variable will be automatically set if you source $SPLUNK_HOME/bin/setSplunkEnv on unix.

Note this post will be written with *nix in mind but it is applicable to Splunk on windows as well.
For a basic primer and a nice diagram you can check out http://www.splunk.com/base/Documentation/latest/Admin/Whatisdistributedsearch

Splunking pitchfork album reviews

One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex :

.*?<h2 class=”fn”>\s*(.*?):<br />([^\n]*)\n.*?<div class=”info”>\n\[([^<;]*);?\s*(\d*)\]?.*?<span class=”rating”>(.*?)<.*?<div class=”content description”>(.*?)</div>.*? - <span class=”reviewer”><span class=”vcard”><span class=”fn”>(.*?)</span>.*?title=”\d+”>(.*?)<

I can now run some interesting queries :

  • * | chart avg(rating) by releaseYear

    Which graphs the average rating per calendar year of the release.
  • *| stats count(title), avg(rating) by artist | search “count(title)”>2| sort “avg(rating)” d | head 10

    This shows the top rated artists that have a least 3 reviews on pitchfork