Hadoop rant

Hadoop’s rise to fame is based on a fundamental optimization principle in computer science: data locality. Which translated to Hadoop speak would be: Move computation to data, not the other way around

In this post I will rant about one core Hadoop area where this principle is broken (or at least not implemented yet). But, before that I will highlight the submission process of a MapReduce job that processes data residing in HDFS:

On the client:
1. gather all the correct confs, user input etc ...
2. contact NameNode to get a list of files that need to be processed
3. generate a lists of splits that need to run Map tasks on, by:
3.1 for each file returned in

» Continue reading

Connecting Splunk and Hadoop

Finally I am getting a some time to write about some cool features of one the projects that I’ve been working on - Splunk Hadoop Connect . This app is our first step in integrating Splunk and Hadoop. In this post I will cover three tips on how this app can help you, all of them are based on the new search command included in the app: hdfs. Before diving into the tips I would encourage that you download, install and configure the app first. I’ve also put together two screencast videos to walk you through the installation process:

Installation and Configuration for Hadoop Connect
Kerberos Configuration
You can also find the full documentation for…

» Continue reading

Got pony – APAC style!

After the pony-fication of the London office we set up a challenge for Eugenia to make our APAC office “complete”. So, she searched everywhere for the best fitting pony, but …. she had a hard time setting her heart on one … so she got two instead :)

I am proud to present Butterpac and Butterbar – APAC’s very own ponies !!!!!

null

Just in case you thought there was some visual trick that duplicated the ponies
null

Welcoming Butterbar & Butterpac to the family !
null

Pony riding pony FTW !!!!
null

Rumor has it these are no simple ponies, if you press the ear (dunno which one) neighs and galloping sounds will be played for your pleasure

» Continue reading

Got pony?

Splunk’s UK office now has it’s very own pony – meet Butternut!

I was visiting our UK office for 2 weeks for partner/support training. This was my first time in London so I found a few things surprising: a) most of the beers served in pubs are flat, wtf? c) the Brits love fried stuff b) the Splunk office was a bit low energy, something was missing. However, the later all change when one day on our way to lunch Jaleh, one of our coworkers, noticed a pony on display – everyone was super excited and we just had to get it !!! For some reason, HR had to approve this first :)

This is…

» Continue reading

Cannot search based on an extracted field

UPDATE: in 4.3 and after search time fields extracted from indexed fields work without any further configuration

In the past couple of days I had to help people from support and professional services troubleshoot the exact same problem twice, so chances it might be useful for you too ;)

The problem
I have setup a regex based field extraction, let’s say the field name is MyField. When I run a search, say “sourcetype=MyEvents” I see that the field is extracted correctly. However, when I run a search based on a value of MyField, say “sourcetype=MyEvents MyField=ValidValue” nothing gets returned. WTF?

The solution
For the impatient, here’s how to solve this.

$SPLUNK_HOME/etc/system/local/fields.conf
[MyField]
INDEXED_VALUE =

» Continue reading

Storing encrypted credentials

Splunk 4.2 was released today and your new resolution:

Build the greatest Splunk app that gathers data from all different source, some that are public and others that require credentials, index them in Splunk and then do some cool things with it.

This blog post will only be concerned with one small, but important aspect of your great app: how to securely store user credentials yet be able to safely access them in clear text when needed. I will split up the post into four sections: get credentials from the user,  access them from your script, where are the credentials stored and security implications.

Get and securely store user credentials

The best time to get user credentials for you app…

» Continue reading

Alert Throttling

NOTE: in 4.2 (released today 3/15/2011) alert suppression/throttling is supported natively by Splunk

Most splunk users soon realize that splunk ships with a scheduler which can be used to run searches periodically and execute some actions (send an email, generate an rss feed , call a script etc) when the results of the search meet some condition. Soon after discovering this feature many users proceed to looking for some mechanism to throttle the alerts issued by splunk.  For example, a common use pattern for alerts is:  check the health of a resource every 5 minutes and send an email alert when the resource is unhealthy, but only send out emails at most every hour.  As of the most recent release…

» Continue reading

Delimiter base KV extraction – advanced

If you’ve read my previous post on delimiter based KV extraction, you might be wandering whether you could do more with it (Anonymous Coward did). Well, yes you can, I am going to cover the “advanced” cases here. Before covering the capabilities, as in other posts, I would first go over some observations and examples.

Observations

  1. Header-body. Some applications, for different reasons, choose to format their log files using a header and a body section. The header usually describes the way the fields are organized in each logged event, while the body consists of logged events, usually one per line, with field values delimited as described in the header. W3C, CSV etc come to mind, see examples
  2. Single-delimiter.

» Continue reading

Delimiter based key-value pair extraction

As described in my previous post, key-value pair extraction (or more generally structure extraction) is a crucial first step to further data analysis. While automatic extraction is highly desirable, we believe empowering our users with tools to apply their domain knowledge is equally important. To this end, this post introduces one of the simplest forms of key-value pair extractions (KV-extraction) – delimiter based extraction.

Observation

Most logged events usually contain a list of key-value pairs (e.g. attribute list, method call values etc) in a context-dependent well-defined format. An example of well-defined format: ” key-value pairs are separated from each other using ‘;’ while the key is separated from the value using ‘=’ “. More generally, well defined attribute…

» Continue reading

Key-value pair extraction definition, examples and solutions….

Most of the time logs contain data which, by humans, can be easily recognized as either completely or semi-structured information. Being able to extract structure in log data is a necessary first step to further, more interesting, analysis. While it would be great to be able to automatically extract the structure from all log data, splunk cannot rival the brain’s performance at this time, however it is able to tap into your brain for help :) Read on ……

Problem definition:

Extract structured information (in the form of key/field=value form) from un/semi-structured log data. Note: for the purpose of this post key or field are used interchangeably to denote a variable name.

Problem examples:

Splunk debug…

» Continue reading