Eureka! Extracting key-value pairs from JSON fields

With the rise of HEC (and with our new Splunk logging driver), we’re seeing more and more of you, our  beloved Splunk customers, pushing JSON over the wire to your Splunk instances. One common question we’re hearing you ask, how can key-value pairs be extracted from fields within the JSON? For example imagine you send an event like this:

{"event":{"name":"test", "payload":"foo=bar\r\nbar=\"bar bar\"\tboo.baz=boo.baz.baz"}}

This event has two fields, name and payload. Looking at the payload field however you can see that it has additional fields that are within as key-value pairs. Splunk will automatically extract name and payload, but it will not further look at payload to extract fields that are within. That is, not unless we tell it to.

Field Extractions to the rescue

Splunk allows you to specify additional field extractions at index or search time which can extract fields from the raw payload of an event (_raw). Thanks to its powerful support for regexes, we can use some regex FU (kudos to Dritan Btincka for the help here on an ultra compact regex!) to extract KVPs from the “payload” specified above.

Setup

To specify the extractions, we will define a new sourcetype httpevent_kvp in %SPLUNK_HOME%/etc/system/local/props.conf by adding the entries below. This regex uses negated character classes to specify the key and values to match on. If you are not a regex guru, that last statement might have made you pop a blood vessel :-)

[httpevent_kvp]
KV_MODE=json
EXTRACT-KVPS = (?:\\[rnt]|:")(?<_KEY_1>[^="\\]+)=(?:\\")?(?<_VAL_1>[^="\\]+)

Next configure your HEC token to use the sourcetype of httpevent_kvp, alternatively you can also set sourcetype in your JSON when you send you event.

Restart your Splunk instance, and you ready to test.

Testing it out

We’ll use curl to test if the new sourcetype is working.

curl -k https://localhost:8088/services/collector -H 'Authorization: Splunk 
16229CD8-BB6B-449E-BA84-86F9232AC3BC' -d '{"event":{"name":"test",
"payload":"foo=bar\r\nbar=\"bar bar\"\tboo.baz=boo.baz.baz"}}'

Heading to Splunk, we can see that the foo, bar and boo.baz fields were properly extracted as interesting fields.

Kvp interesting fields

Now heading to “All Fields” we can select each of the new fields.

Kvp select fields

And then see the values magically show up!

Kvp fields

Considerations:

There’s a few things to consider when using this approach.

  • It does have a cost and is not free. The extractions will run for all searches against the sourcetype that has the extractions defined for it. You’ll want to measure the performance impact to ensure the degradation is acceptable for your patterns of querying. You may be able to further refine these regexes further to limit the amount of matching which will help. Alternatively, using index time extractions will minimize search time impact but will slow down data ingest and also will increase the storage / license hit.
  • The regexes included here may or may not work based on your payloads and it may have to be tweaked. You’ll need to test to ensure that fields are properly being extracted. For example with the current regex if a key is sent like ” foo” with a leading space, after the quote, Splunk will extract the field name with the leading space.
  • The approach is brittle as it depends on clients sending data in a format that is compatible with the regexes. You may have to tweak the regexes over time if the format changes or new types of data appear. Unless you have a good idea of the kind of data that is being sent, it may not work for you.

In short, make sure you test.

Summary

Using this approach provides a way to allow you to extract KVPs residing within the values of your JSON fields. This is useful when using our Docker Log driver, and for general cases where you are sending JSON to Splunk.

In the future, hopefully we will support extracting from field values out of the box, in the meanwhile this may work for you.

Note: Special thanks to Martin Müller who provided tweaks to the regexes to improve performance and for his suggestions in the considerations section.