Faster and limitless Hunk archiving to S3 with Hadoop 2.6.0

We’ve learned that Hunk can archive Splunk buckets to HDFS and S3. In this post we’ll see how we can use the new S3 integration introduced in Apache Hadoop 2.6.0, to get better performance and avoid the 5GB file size upload limit.

Edit: While the title states “limitless” the actual limit of a single file object in S3 is currently 5TB.

The new S3 filesystem – S3A

Apache Hadoop 2.6.0 incorporates a new S3 filesystem implementation which has better performance and supports uploads larger than 5GB. The new S3 filesystem is named S3A. It is used with Hadoop by configuring your paths with a s3a prefix like so: s3a://<bucket>/<path>. This should be familiar to you if you’ve used Hadoop + S3 before, where you’ve prefixed paths with s3:// or s3n://

Using S3A with Hunk

To use the S3A filesystem with Hunk you need to create a Hunk provider that points to Hadoop client libraries that has the S3AFileSystem java class, such as Apache Hadoop 2.6.0. Here’s how I configured a provider to use the S3A filesystem:

  1. Hadoop Home: /absolute/path/to/apache/hadoop-2.6.0
  2. vix.fs.s3a.access.key: <AWS access key>
  3. vix.fs.s3a.secret.key: <AWS secret key>
  4. vix.env.HADOOP_TOOLS: $HADOOP_HOME/share/hadoop/tools/lib
  5. vix.splunk.jars: $HADOOP_TOOLS/hadoop-aws-2.6.0.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar

Configuration details

Depending on your Hadoop distribution, you might not have to include extra jars in your classpath (step 4 and 5). We have to do it when using Apache Hadoop 2.6.0 because the S3AFileSystem.class and its dependencies are not included in the default classpath. Also note that the location and versions of your jars containing the S3AFileSystem.class may be different in your Hadoop distribution.

Another cool configuration trick is that you can define custom environment variables (step 4) to make the configuration nice and generic. The name of the environment variable can be any valid environment variable name, it doesn’t have to be “HADOOP_TOOLS”. This trick doesn’t work for all configuration values, but it does work for vix.splunk.jars.

Archive Indexes and Virtual Indexes

Now we can just create new or change our existing virtual indexes and archive indexes to use our newly configured provider, and set paths with the s3a prefix like so: s3a://<bucket>/<path>. And just like that we’re searching and archiving data more efficiently!

More performance and no more scary upload limits, just like that :)