Splunk Archive Bucket Reader and Hive
This year was my first .conf, and it was an amazingly fun experience! During the keynote, we announced a number of new Hunk features, one of which was the Splunk Archive Bucket Reader. This tool allows you to read Splunk raw data journal files using any Hadoop application that allows the user to configure which InputFormat implementation is used. In particular, if you are using Hunk archiving to copy your indexes onto HDFS, you can now query and analyze the archived data from those indexes using whatever your organization’s favorite Hadoop applications are (e.g. Hive, Pig, Spark). This will hopefully be the first of a series of posts showing in detail how to integrate with these systems. This post is going to cover some general information about using Archive Bucket Reader, and then will discuss how to use it with Hive.
Getting to Know the Splunk Archive Bucket Reader
The Archive Bucket Reader is packaged as a Splunk app, and is available for free here.
It provides implementations of Hadoop classes that read Splunk raw data journal files, and make the data available to Hadoop jobs. In particular, it implements an InputFormat and a RecordReader. These will make available any index-time fields contained in a journal file. This usually includes, at a minimum, the original raw text of the event, the host, source, and sourcetype fields, the event timestamp, and the time the event was indexed. It cannot make available search-time fields, as these are not kept in the journal file. More details are available in the online documentation.
Now let’s get started. If you haven’t already, install the app from the link above. If your Hunk user does not have adequate permissions, you may need the assistance of a Hunk administrator for that step.
Log onto Hunk, and look at your home screen. You should see a “Bucket Reader” icon on the left side of the screen. Click on this. You should see a page of documentation, like this:
Take some time and look around this page. There is lots of good information, including how to configure Archive Bucket Reader to get the fields you want.
Click on the Downloads tab at the top of the page. You should see the following:
There are two links for downloading the jar file you will need. If you are using a Hadoop version of 2.0 or greater (including any version of Yarn), click the second link. Otherwise, click the first link. Either way, your browser will begin downloading the corresponding jar to your computer.
Using Hive with Splunk Archive Bucket Reader
We’ll assume that you already have a working Hive installation. If not, you can find more information about installing and configuring Hive here.
We need to take the jar we downloaded in the last section, and make it available to Hive. It needs to be available both to the local client, and on the Hadoop cluster where our commands will be executed. The easiest way to do this is to use the “auxpath” argument when starting Hive, with the path to the jar file. For example:
hive --auxpath /home/hive/splunk-bucket-reader-2.0beta.jar
If you forget this step, you may get class-not-found errors in the following steps. Now let’s create a Hive table backed by a jounal.gz file. Enter the following into your Hive command-line:
CREATE EXTERNAL TABLE splunk_event_table ( Time DATE, Host STRING, Source STRING, date_wday STRING, date_mday INT ) ROW FORMAT SERDE 'com.splunk.journal.hive.JournalSerDe' WITH SERDEPROPERTIES ( "com.splunk.journal.hadoop.value_format" = "_time,host,source,date_wday,date_mday" ) STORED AS INPUTFORMAT 'com.splunk.journal.hadoop.mapred.JournalInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/hive/user_data’;
If this was successful, you should see something like this:
OK Time taken: 0.595 seconds
Let’s look at a few features of this “create table” statement.
- First of all, note the EXTERNAL keyword in the first line, and the LOCATION keyword in the last line. EXTERNAL tells Hive that we want to leave any data files listed in the LOCATION clause in place, and read them when necessary to complete queries. This assumes that /user/hive/user_data contains only journal files. If you want Hive to maintain it’s own copy of the data, drop the EXTERNAL keyword, and drop the LOCATION clause at the end. Once the table has been created, use a LOAD DATA statement.
- The line
STORED AS INPUTFORMAT 'com.splunk.journal.hadoop.mapred.JournalInputFormat'
tells Hive that we want to use the JournalInputFormat class to read the data files. This class is located in the jar file that we told Hive about when we started the command-line. Note the use of “mapred” instead of “mapreduce”—Hive requires “old-style” Hadoop InputFormat classes, instead of new-style. Both are available in the jar.
- These lines:
ROW FORMAT SERDE 'com.splunk.journal.hive.JournalSerDe' WITH SERDEPROPERTIES ( "com.splunk.journal.hadoop.value_format" = "_time,host,source,date_wday,date_mday" )
tell Hive with fields we want to pull from the journal files to use in the table. See the app documentation for more detail about which fields are available. Note that we are invoking another class from the Archive Bucket Reader jar, JournalSerDe. “SerDe” stands for serializer-deserializer.
- This section:
(Time DATE, Host STRING, Source STRING, date_wday STRING, date_mday INT)
tells Hive how we want the columns to be presented to the user. Note that there are the same number of columns here as in the SERDEPROPERTIES clause. This section could be left out altogether, in which case each field would be treated as a string, and would have the name it has in the journal file, e.g. _time as a string instead of Time as a date.
Now that you have a Hive table backed by a Splunk journal file, let’s practice using it. Try the following queries:
select * from splunk_event_table limit 10; select count(*) from splunk_event_table group by host; select min(time) from splunk_event_table;
Hopefully that’s enough to get you started. Happy analyzing!