Using Hadoop RecordReaders with Hunk
Hunk is able to process any data format that has a RecordReader a.k.a pre-processor. In previous posts, we showed you how to use pre-processors to search image data with Hunk and how you can write your own RecordReader. In this post, you’ll learn how you can use existing Hadoop RecordReader’s with Hunk, without any modifications!
Hunk’s Hadoop RecordReader requirements
The prerequisites for using Hadoop RecordReader’s with Hunk are:
- The RecordReader has a constructor that takes no arguments.
- The RecordReader is using RecordReader.initialize(InputSplit, TaskAttemptContext) for initialization.
- The .toString() method of the value object that your RecordReader returns from the method .getCurrentValue(), returns a valid data representation which Splunk understands. I.e. you want to have overwritten the java.lang.Object‘s default implementation of .toString(). You could for example return a json string.
A RecordReader is configured by specifying the java class and a regex. The RecordReader is going to be used for all the files that matches the specified regex. You can configure RecordReaders for virtual indexes or for providers. If you configure a RecordReader on a provider, it will be enabled for all virtual indexes which uses that provider. If you configure a RecordReader on a virtual index, it’s only going to be enabled for that virtual index. Here’s how you would configure a RecordReader with class com.company.bigdata.AwesomeFileRecordReader, which matches files ending in .azm:
vix.input.1.recordreader = com.company.bigdata.AwesomeFileRecordReader
vix.input.1.recordreader.AwesomeFileRecordReader.regex = \.azm$
vix.splunk.search.recordreader = com.company.bigdata.AwesomeFileRecordReader
vix.splunk.search.recordreader.AwesomeFileRecordReader.regex = \.azm$
When configuring a RecordReader on the provider, you’ll notice that your vix.splunk.search.recordreader will already contain RecordReaders and you’ll want to append your class to this comma separated list. You’ll also notice that these RecordReaders are not specified with the full package name. This is because we’re prepending com.splunk.mr.input. to RecordReaders which have no packages to make our own configuration shorter.
RecordReader configuration identifiers
We need an identifier for the RecordReader to configure the regex. For BaseSplunkRecordReader’s we’re using the BaseSplunkRecordReader.getName() method to get this identifier. Normal Hadoop RecordReaders doesn’t have the .getName() method, so we decided to use Java’s Class.getSimpleName() instead. If you’re RecordReader is an inner class, you’ll just want to specify the name of the RecordReader class. E.g. for an imaginary RecordReader BazRecordReader, which happens to be an inner class of com.company.bigdata.FooBar, you’d configure your RecordReader on the virtual index as:
vix.input.1.recordreader = com.company.bigdata.FooBar$BazRecordReader
vix.input.1.recordreader.BazRecordReader.regex = <your-regex>
RecordReader matching order
For every file, Hunk will use the first RecordReader which has a regex that matches the files path. RecordReader’s configured on the virtual index will be tried before the RecordReader’s specified on the provider. Both the vix.input.1.recordreader and vix.splunk.search.recordreader configurations are comma separated lists and we’ll test the RecordReader’s in the order as they appear in this list.
Get more performance by using com.splunk.mr.input.BaseSplunkRecordReader
You want to create a new RecordReader wrapper that extends BaseSplunkRecordReader when either:
- Your RecordReader does not meet the prerequisites for working with Hunk (see section on prerequisites).
- You want more performance out of your RecordReader.
One way to get more performance out of your RecordReader is by using BaseSplunkRecordReader.serializeCurrentValueTo(OutputStream). Using this method and stream string bytes to the OutputStream instead of returning a Text object from .getCurrentValue(), can save you byte-array copies and other object allocations.
Another way to get more performance is to use the list of required fields that are available to you in the BaseSplunkRecordReader in a protected field named _requiredFields. The fields in this list are the only fields that Hunk is going to use from your output. It may contain "*", which means that it may match all of the fields that you output, but it may also contain a few specific fields, where you can choose to only output those fields and spend less time parsing the output from your RecordReader to Hunk.
Here’s a list of examples where the require fields list will contain "*". You can think of it as when your results will the include the raw data:
- index=vix source=*myfile.ext
- index=vix foo bar baz=5
Here’s a list of examples where the required fields list will contain a few specific fields, where you have an opportunity to filter your RecordReader output. Think of it as when your results will be tables:
- index=vix | stats count by source
- index=vix | table foo, bar
- index=vix | fields foo, bar
I’ve gotten order of magnitudes better performance out of utilizing the required fields.
I hope this was helpful for you and that you’ll try your existing RecordReaders with Hunk. I also recommend that you invest some time into extending the BaseSplunkRecordReader for more performance once you’ve gotten your RecordReader working.