david: api

Anomalies: How to find what you’re looking for, without looking for it

Very often you want to find “problems” in your IT data, but you don’t know what to look for. How can you find these problems with Splunk?

In Splunk’s new search language, there are several search operators that can help you. I’ll describe only a subset of what is possible.

  • 1) You can search for unexpected events by looking at those that do not cluster into large groups. For example, you can cluster the errors in the last hour and report on the events the belong in the smallest clusters (e.g., ‘error | cluster showcount=true | sort - cluster_count | head 5′).
  • 2) You can find unexpected events by finding values that are far from the standard deviation. For example, you can search for sendmail events with anomalous ‘delay’ values (e.g., ’sourcetype=sendmail_syslog | anomalousvalue delay action=filter pthresh=0.02′).
  • 3) You can use machine learning to find events that have unexpected values based on the past historical context (e.g., ‘* | anomalies blacklist=boringevents’).
  • 4) It’s a little bit of a hand-wave — but you can do really cool graphical reports that often make anomalies visibly obvious. For example, you could create a timechart of average cpu_seconds by host, and visibly see problems (e.g., ’sourcetype=top | timechart avg(cpu_seconds) by host’).