Writing Actionable Alerts

Is your Splunk environment spamming you? Do you have so many alerts that you no longer see through the noise? Do you fear that your Splunk is losing its purpose and value because users have no choice but to ignore it?

I’ve been there. I inherited a system like that. And what follows is an evolution of how I matured those alerts from spams to saviors.

Let it be known that Splunk does contain a number of awesome search commands to help with anomaly detection. If you enjoy what you read here, be sure to check them out since they may simplify similar efforts. http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Commandsbycategory#Find_anomalies

Stage 1: Messages of Concern

Some of the first alerts created are going to be searches for a specific string like the word “error”. Maybe that search runs every five minutes and emails a distribution list when it finds results. Guess what? With a search like that, you’ve just made a giant leap into the world of spammy searches. Congratulations…?!

Stage 2: Thresholds

Ok, so maybe alerting every time a message appears is too much. The next step is often to strengthen that search by looking for some number of occurrences of the message over some time period. Maybe I want to be alerted if this error occurs more than 20 times in a five minute window.

Unfortunately, even with this approach you’ll soon hit your “threshold” (pun) for threshold-based searches. What you didn’t know is that you tuned this search during a low part of the day and the 20 times is too low for peak activity. The next logical step is to increase the thresholds to something like 40 errors every five minutes to accommodate the peak and ignore the low periods when there’s less customer impact. Not ideal but still an improvement.

Unfortunately, you’re still driving blind. Over time, if business is bad, you may stop seeing alerts simply because the threshold of 40 errors every five minutes is unattainable for the lower customer usage of the system. In that case, you still have issues but you’re ignoring them! What about the inverse? If business is good, usage of the system should pick up and a threshold of 40 errors every five minutes will generate spam. In both scenarios you’re not truly seeing the relative impact of the error to the overall activity on the system during that time period.

Stage 3: Relative Percentages

Fortunately, you’ve realized the oversight you’ve made. So you embrace the stats command to do some eval-based statistics to only alert you when the errors you want are more than some percentage (let’s say, 50%) of the overall event activity during that time window. Wow! Amazing improvement. Less spam and you are now seeing spikes only! Good job!

Of course, we can do even better. Who says that it isn’t normal for the errors to be more than that 50%? Now we’re back at Step 2 and tweaking relative thresholds based on our perception of what is a “normal” percentage of errors for this system. If you’re with me so far, then buckle up, because we’re about to move into the “behavioral” part of the discussion.

Stage 4: Average Errors

You want to calculate what is normal for your data. You decide that using the average number of errors over some time period is your way to go. So you now use timechart to determine the average number of errors over a larger time window and compare that result to the number of errors in the current time window. If the current time window has more, then you alert.

Your search probably looks fancy by now. You may have even implemented a summary index or some acceleration and switched to the tstats command to facilitate the historical average. If you didn’t know Splunk’s search language yet, you’re definitely learning it now.

There’s just one little catch: When we compare against the average (or better yet, the median), we are going to trigger an alert half the time simply because of how the average is calculated. That’s no good because the goal here is to reduce the spam and instead we just created a dynamic threshold that is still alerting us about half of all time periods.

Stage 5: Percentiles

Now, if you don’t remember percentiles from grade school math, fear not! I didn’t either and I simply read the Wikipedia entry and got it pretty quickly. I’ll try to explain it here in the context of our challenge but I’m sure there are a bazillion pages on the interwebs where it’s better described.

So far, our search is looking at the number of times our error appears over the last five minutes. It compares that quantity against the average we’ve seen for all prior five minute windows, and, if the current five minute window is larger than the average, an alert is triggered.

Now what if instead of the average, we wanted to alert only when the current five minute window’s count of errors is larger than the maximum count of such errors within a five minute window. That would be cool because we’d know if we’ve spike higher than ever before! But what if we already know our historical record has some wild spikes that are too high to compare against. If we assume a normal distribution (silly math stuff), then we can ignore the top five percent of high values. So if we took all the historical values we had and wanted to compare against everything except the highest set of values that are five percent of the total historical snapshots we have, we would be talking about the 95th percentile.

Update: My data science hero, Dr. Tom LaGatta, highlighted some very important assumptions of the above paragraph that, when highlighted, will make you an even stronger Alerting Superhero. Be careful when assuming a normal distribution with your IT & Security data. In fact, if you play with the data in Splunk you can explore its distribution (hint: bin or bucket commands) and see the true distribution. Dr. LaGatta notes that you’re even likely to see heavy tailed distributions where much of the activity may even occur in that last five percentile (in the above scenario). Therefore, explore the data and consider what you feel gives you a good handle around the problem at hand. Even if you follow the approach discussed in this post, at least you know the truth around the assumptions you’ve made and won’t be confused about the results.

Did I lose you there? Let’s try another approach: When I compare the current five minute window’s count of errors against the historical snapshots of prior five minute windows, for the 95th percentile, I know that if I were to order all the historical results, 95% of the results are lower than the 95th percentile and only five percent of the results are higher.

If you’re still not with me, see if Siri can help. It’s cool. I’ll wait ’cause I realize this is getting confusing….

…waiting….

Cool! You’re back! Thanks for coming back! And you got me a coffee? Wow, you’re so sweet!

Ok, so where I was going with this is that we now have, instead of the average, a higher threshold to compare against. That means that based on historical observations, we will only alert when the current window’s error count has gone over that higher threshold (like 95th percentile). Why did I pick 95th percentile? Let me tell you about something I call The Lasso Approach.

The Lasso Approach

What I call the Lasso Approach is a triage strategy I created for getting a perimeter around an error, or as I think of it, getting a lasso around your problem.

For an especially spammy system, set some high thresholds. Something like the 95th percentile for that given error. By adjusting your searches to this, you will only be alerted roughly five percent of the time. That means you will only be alerted to the most flagrant issues.

Fix those bad actor bugs causing all those nasty alerts. Eventually the 95th percentile is too generous and you’ll lower it to to the 75th percentile (or third quartile if I’m getting my math concepts right). With the lower threshold will come more alerts. Fix those, rinse and repeat. Eventually you discover that you’ve triaged the system by cleaning up the most common errors first. Soon your machine data and email inbox will be readable because the spam of errors are gone!

BONUS Stage 6: IT Service Intelligence

I would be remiss if I didn’t mention the next step in this alerting story: IT Service Intelligence (or IT SI for short). IT SI takes alerting a step further in two ways (does that make this two steps further?):

  • Thresholds
  • Time Policies
  • Adaptive Thresholding

Let’s talk about them each real quick.

Thresholds add functionality so instead of the binar-ity (made up a word?) of alert or no alert, a given search (or KPI in this context) can be any number of statuses such as Healthy, Unhealthy, Sick, Critical, or Offline.

Time Policies build on the challenge earlier in this post about busy hours versus quiet hours. It makes perfect sense to have different thresholds for different times of year, day vs night, or peak vs non-peak. That’s what Time Policies give you.

Lastly, Adaptive Thresholding builds on what we did with the percentile work. Within IT SI any KPI can be configured with static thresholds or thresholds that are indicative of the behavior of that data set over time. That means your thresholds adjust as your business adjusts – all in an easy to use UI.

Here’s some documentation on all those features in case you want to see what it looks like. ITSI has so much more in it, but this blog entry has already going on enough. http://docs.splunk.com/Documentation/ITSI/latest/Configure/HowtocreateKPIsearches#Create_a_time_policy_with_adaptive_thresholding

Stage 7: Actionable Alerts

This isn’t so much a stage like the prior ones. Merely a call out for you to enjoy the fact that your Splunk system now only alerts when something truly needs attention, i.e. Actionable Alerts.

Congratulations and get ready for your promotion.