Using Splunk Archive Bucket Reader with Pig

Update 9/27/16: As of Sept. 27, 2016, Hunk functionality has been incorporated into the Splunk Analytics for Hadoop Add-On and Splunk Enterprise versions 6.5 and later.

This is part II in a series of posts about how to use the Splunk Archive Bucket Reader. For information about installing the app and using it to obtain jar files, please see the first post in this series.

In this post I want to show how to use Pig to read archived Splunk data. Unlike Hive, Pig cannot be directly configured to use InputFormat classes. However, Pig provides a Java interface—LoadFunc—that makes it reasonably easy to use an arbitrary InputFormat with just a small amount of Java code. A LoadFunc is provided with Splunk Archive Bucket Reader: com.splunk.journal.hive.JournalLoadFunc. If you would prefer to write your own, you can find more information here.

Whereas Hive closely resembles a relational database, Pig is more like a high-level imperative language for creating Hadoop jobs. You tell Pig how to make data “relations” from data, and from other relations.

In the following, we’ll assume you already have Pig installed and configured to point to your Hadoop cluster, and that you know how to start an interactive session. If not, you can find more information here.

Here is an example Pig session. The language used is called Pig Latin.

REGISTER splunk-bucket-reader-1.1.h2.jar;
A = LOAD 'journal.gz' USING com.splunk.journal.pig.JournalLoadFunc('host', 'source', '_time') AS (host:chararray, source:chararray, time:long);
B = GROUP A BY host;
C = FOREACH B GENERATE group, COUNT(A);
dump C;

Let’s look at these statements in more detail.

  • First:
    REGISTER splunk-bucket-reader-1.1.h2.jar;

    This statement tells Pig where to find the jar file containing the Splunk-specific classes.

  • Next:
    A = LOAD 'journal.gz' USING com.splunk.journal.pig.JournalLoadFunc('host', 'source', '_time') AS (host:chararray, source:chararray, time:long);

    This statement creates a relations called “A” that contains data loaded from the file ‘journal.gz’ in the user’s HDFS home directory. The expression “(‘host’, ‘source’, ‘_time’)” determines which fields will be loaded from the file. The expression “AS (host:chararray, source:chararray, time:long)” determines what they will be named in this session, and what data types they should be assigned.

  • Next:
    B = GROUP A BY host;
    C = FOREACH B GENERATE group, COUNT(A);

    These statements say that we want to group events (or in Pig-speak, tuples) together based on the “host” field, and then count how many tuples each host has.

  • Finally:
    dump C;

    This tells Pig that we want the results printed to the screen.

I ran these commands on a journal file containing data from the “Buttercup Games” tutorial, which you can download from here. They produced these results:

(host::www1,24221)
(host::www2,22595)
(host::www3,22975)
(host::mailsv,9829)
(host::vendor_sales,30244)

Viola! Now you can use Pig with archived Splunk data.