Splunking Continuous REST Data

One of the ways vendors expose machine data is via REST. There are a couple of ways to get REST data into Splunk today:

  1. Use Damien Dallimore’s REST API Modular Input – you can provide a custom response handler for this input to persist state.
  2. Use the new Splunk Add-on Builder – this method will do a “one shot” of the REST endpoint – meaning, every time the input runs, it will get all the data every time.

In this post, I will show you how to implement a cursor mechanism (i.e. pick up where you left off last time) for REST endpoints that continually have new data over time using the checkpoint mechanism built into modular inputs.

The Data Source

For this example, we will ingest JSON data from a tumblr blog – http://ponidoodles.tumblr.com.  I chose this as an example because the v1 REST endpoint in Tumblr is open and easy to use for an example (no authentication required).  Plus, this one it is about ponies.

The API documentation and parameters can be found here https://www.tumblr.com/docs/en/api/v1

We will use 2 of the available parameters:

  • start – this is the post offset to start pulling posts
  • num – this specifies the number of posts to pull.

Getting the Data in

Splunk REST Data

Following is the pseudo-code we will use to get the data:

  1. Get the starting position from a checkpoint
  2. If there is no checkpoint, set the starting position to 0
  3. Pull up to 5 posts from the endpoint starting at the starting position
  4. Count the number of posts read
  5. Stream each post to Splunk
  6. Add the number of posts read to the starting position
  7. Save the new starting position (in the first case, the new starting position will be 5)
  8. Repeat

To keep the code concise, we will use the Splunk Python SDK to create a modular input.

In the Splunk Python SDK, all the magic happens in the stream_events method.

In order to implement the checkpoint mechanism based on the pseudo code above, I stole borrowed some code from the Splunk Add-on builder to abstract the check pointing mechanics.

Here is an actual code snippet:

state_store = FileStateStore(inputs.metadata, self.input_name)
last_position = state_store.get_state("last_position") or 0
self.url = "%s?start=%s&num=%s" % (rest_url, str(last_position), str(num))
http_cli = httplib2.Http(timeout=10, disable_ssl_certificate_validation=True)
resp, content = http_cli.request(self.url, method=self.rest_method, body=urllib.urlencode(self.data), headers=self.header)
jsVariable = content.decode('utf-8', errors='ignore')
# The response from this particular REST endpoint delivers content in a JavaScript variable.
#   - Example:  tumblr_api_read = {“key”:value, “key”:value, “posts”:[array of posts]}
#
# The below line strips out the unnecessary text to get just the JSON
jsonValue = json.loads('{%s}' % (jsVariable.split('{', 1)[1].rsplit('}', 1)[0],))
num_posts_streamed = 0
for post in jsonValue["posts"]:
    num_posts_streamed += 1
    # Stream the event here
    # Store the position to pick up on next time
last_position = int(last_position) + num_posts_streamed
state_store.update_state("last_position", str(last_position))

The complete code can be found on GitHub.

Note: The method we used here for saving a checkpoint is very basic (i.e. counting the number of posts) and may not apply to your situation.  Sometimes, the REST data may give you a continuation token and something like the following may be necessary:

if "nextLink" in jsonValue:
    state_store.update_state("nextLink ", jsonValue[“nextLink”])

Microsoft Azure Audit does this for instance.

Testing the Input

A nice way to test you input prior to using it in your Splunk environment is to use the Splunk CLI.  First copy the contents from the GitHub repo above to your $SPLUNK_HOME/etc/apps folder.  Next, execute the following (Splunk is installed in /opt/splunk in this case):

/opt/splunk/bin/splunk cmd splunkd print-modinput-config splunk_rest_example splunk_rest_example://RESTTest | /opt/splunk/bin/splunk cmd python /opt/splunk/etc/apps/TA_rest-example/bin/splunk_rest_example.py

The checkpoint location is

$SPLUNK_HOME/var/lib/splunk/modinputs/splunk_rest_example

There will be a file in there called last_position that gets updated on each run.  Open it up with a text editor to see for yourself.

Clearing Input Data

If you want to reset the checkpoint file, run the following command:

/opt/splunk/bin/splunk clean inputdata splunk_rest_example

Note: you can also clean eventdata to remove indexed data.  For testing purposes, I usually write events to a staging index (this is done via inputs.conf) and clean that index as needed.

Distributed Deployments

All the code and examples above were run on a single Splunk instance.  If you plan on using these techniques in a distributed deployment, the recommend architecture is to run the input on a heavyweight forwarder.  For more information about where to install add-ons in a distributed deployment, check the Splunk documentation.

Incorrect about the REST Mod Input. It is used most frequently in continuous scenarios , not just “one shot”. To accomplish this you simply declare a custom response handler for your REST input.

Check out this Twitter response handler : https://github.com/damiendallimore/SplunkModularInputsPythonFramework/blob/master/implementations/rest/bin/responsehandlers.py#L145

Essentially it is keeping track of the “since_id” and this gets used in subsequent requests and is automagically persisted back to inputs.conf (ie: to survive restarts etc…)

Damien Dallimore
May 12, 2016