Using Hadoop RecordReaders with Hunk
Hunk is able to process any data format that has a RecordReader a.k.a pre-processor. In previous posts, we showed you how to use pre-processors to search image data with Hunk and how you can write your own RecordReader. In this post, you’ll learn how you can use existing Hadoop RecordReader’s with Hunk, without any modifications!
Hunk’s Hadoop RecordReader requirements
The prerequisites for using Hadoop RecordReader’s with Hunk are:
- The RecordReader has a constructor that takes no arguments.
- The RecordReader is using RecordReader.initialize(InputSplit, TaskAttemptContext) for initialization.
- The .toString() method of the value object that your RecordReader returns from the method .getCurrentValue(), returns a valid data representation which Splunk understands. I.e. you want to have overwritten the java.lang.Object‘s default implementation
Faster and limitless Hunk archiving to S3 with Hadoop 2.6.0
We’ve learned that Hunk can archive Splunk buckets to HDFS and S3. In this post we’ll see how we can use the new S3 integration introduced in Apache Hadoop 2.6.0, to get better performance and avoid the 5GB file size upload limit.
Edit: While the title states “limitless” the actual limit of a single file object in S3 is currently 5TB.
The new S3 filesystem – S3A
Apache Hadoop 2.6.0 incorporates a new S3 filesystem implementation which has better performance and supports uploads larger than 5GB. The new S3 filesystem is named S3A. It is used with Hadoop by configuring your paths with a s3a prefix like so: s3a://<bucket>/<path>. This should be familiar to you if …
New in Hunk 6.2.1: Splunk Archiving & Searchable Archives!
- Archive your existing Splunk indexer’s data with a Hunk 6.2.1
- Search archived data in place from the Hunk search head
- Documentation here!
Archive Splunk Data
Hunk 6.2.1 enables you to continuously archive your Splunk data to Hadoop, by pointing a Hunk search head to your Splunk indexers and configuring an new Archive Indexes.
Searching archived data
You can search archived data in place on Hadoop just as easily as you would search any other Splunk index. There’s no need to move data more than once. This works because Hunk already knows how to efficiently search data in Hadoop. We just had to archive the data in a file structure such that Hunk could efficiently prune the data by time.
Hunk Preprocessors: How to DIY
In the previous blog post on image searching with Splunk, I showed you how you can preprocess data with Hunk to get the ability to Splunk any data. This blog post is all about how to do it yourself.
Before we start, here are links to the code for the image preprocessor demo:
The first link has all the preprocessor code and the second link has the code for making the sweet image UI. You can look at it before, while and/or after reading the rest of the blog post. Enjoy!
A Hunk preprocessor is basically just a Hadoop’s RecordReader<K, V>, where K is irrelevant and V is Text. We provide a base class that …
Image Search with Splunk and Hunk
One of the sexy new features Hunk brings to the Splunk 6 smorgasbord, is preprocessing data. Since Hunk is built on top of Hadoop’s MapReduce framework, we can utilize it’s preprocessing framework. Basically, now you can take any data, write a piece of code that turns it into text, then search where it is stored!
Update: Code is open sourced here!
I’ve created a demo where you can select colors and get images that match the selection. It looks like this:
Image searching in Splunk? How is this possible? Indexing images?
Indexing images, no. Preprocessing at search time. There are no indexing costs.
I do this by searching a set of images stored on HDFS, my preprocessor extracts the color distribution …
Splunking for the homeless
Preparing data for Splunk
#!/bin/bash craigslist_search="http://sfbay.craigslist.org/search/apa/sfc?zoomToPosting=&query=&srchType=A&minAsk=2000&maxAsk=3500&bedrooms=2&nh=4&nh=11&nh=10&nh=18&nh=29" curl -s $craigslist_search | \ sgrep -o "%r\n" -i…
Splunk @ jQuery developer summit 2012
jQuery developer summit is an anual opportunity to meet the entire jQuery team and learn how you can get involved in the jQuery community. They tell you how each jQuery project works, what’s being done and how you can contribute to the projects.
Splunk sponsored the event and has given the jQuery foundation a non-profit Splunk license.
I went to this event together with four awesome Splunkers from the San Francisco office all the way to AOL in Dulles, Virginia.
jQuery treated us and all the other attendees with food, beer and a chair, from sunday to tuesday. It was a very good time and we met a lot of very nice people.
Before the event …
Splunkgit – Github just got Splunked! (Part 4/4)
This is the fourth and last part in a four part series where Petter and Emre covers their Splunk app, Splunkgit. The Splunk app is available for download on splunkbase here, and it is also on github here.
Usages of the git tab
Splunkgit – Github just got Splunked! (Part 2/4)
This is the second part in a four part series where Petter and Emre covers their Splunk app, Splunkgit. The Splunk app is available for download on splunkbase here, and it is also on github here. You can find part 1 here if you missed it.
Who am I?
Hello there blog reader! As this is my first post, I will do what I am told and introduce myself. My name is Petter Eriksson and I study computer science at Royal Institute of Technology, Stockholm, Sweden. I am an intern here at Splunk for 6 months, where I will be doing software development and also my master thesis. Being an intern here at Splunk has been great so …