Platform

January 03, 2013

2 Minute Read

Predicting Missing Data

By Splunk

Teach Splunk to predict missing field values in your data! With the brand new Splunk Predict App, you can predict, and fill-in, the value of missing fields in your data, using training sets that have values. This app builds Naive Bayes models to predict field values. In some test sets, this model often predicted values correctly 99.95%+ of the time.

From customers that fill out their gender, you can predict the gender of customers that have not, perhaps based on writing style, word choice, or other features.
From events that list a host name, you can predict the host name for events that are missing it.
From customers that explain why they unsubscribed from a mailing list, predict why others left even if they didn’t say why.

If you have the actual field value in question, use the predicted value against the actual value to determine if values are unexpected. Does the event’s data look like it belongs in this source of data, or is it suspicious.

Suppose you have a dataset with missing or questionable values. You can now predict the missing values based on other values. For example, in human entered data or social media data (e.g., twitter), imagine predicting the political or demographic information based on zipcode, first name, salary etc. Alternatively, you have one dataset that has a field filled out and another data set where that field is missing or sporadic.

Lastly, you can use the Predict app for sentiment analysis. For example, you can have a small training set of emails, each marked up with “angry=10” or “angry=1”, and have it learn to recognize angry emails. Angry emails can get directly routed to a manager.

App Details

This app includes four search commands:

train to train the model to predict a field value
guess to fill in missing field values
reset to delete a trained model
icluster to cluster data based on it’s information similarity. Are two emails written by the same user, using different accounts

For details on the parameters for each of these commands, typeahead will provide all the defaults. Make sure to click More on the typeahead instructions.

Examples

For example, to learn gender from names, you might say train it with:

gender=* | fields name, gender | train name2gender from gender

If you don’t limit the fields to “name” and “gender” it will use all fields to predict gender. If you have an inkling of what fields can predict other fields, limit things, otherwise, don’t bother and it will figure it out.

You can have it predict “gender” for events that don’t have a gender field specified.

* | guess name2gender into gender

Another example, predict the sourcetype from the _raw text of events. First train a model:

index=_internal | train getsrctype from sourcetype

Then use that model to guess sourcetypes and compare it to the real sourcetype value to measure accuracy:

index=_internal | rename sourcetype as real_sourcetype | fields real_sourcetype



 | guess getsrctype into sourcetype | top sourcetype,real_sourcetype

----------------------------------------------------
Thanks!
David Carasso

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

Platform 1 Min Read

Splunk Edge Processor Enhancements Offer Greater Data Access and Improve Data Management

On the heels of an exciting GA in March and the April announcement of its regional expansion, we are excited to share the latest updates to Splunk Edge Processor that will make it even easier for customers to have more flexibility and control over just the data you want, nothing more nothing less.

Platform 2 Min Read

A Deeper Dive into Machine Learning at Splunk

Ever wondered where to get started with machine learning at Splunk? This blog contains links to deep dives that provide end-to-end guides for how to implement specific use cases against your own data.

Platform 6 Min Read

Introducing New Deep Learning NLP Assistants for DSDL

The Splunk App for Data Science and Deep Learning (DSDL) now has two new assistant features for Natural Language Processing. DSDL has been offering basic natural language processing (NLP) capabilities using the spaCy library.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Predicting Missing Data

App Details

Examples

Related Articles

Splunk Edge Processor Enhancements Offer Greater Data Access and Improve Data Management

A Deeper Dive into Machine Learning at Splunk

Introducing New Deep Learning NLP Assistants for DSDL

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram