erik: Archive for August, 2007

$SPLUNK_HOME

For the first few years it was in garages and basements. Then we graduated to squatting with friends ( thank you sixaprt, boulder ventures, and sevin rosen ). Finally we scored our own space in SOMA - just across from the PacBell/ATT/Verizon/TMobile/Comcast Park.

… 4th and 5th floor in the taller of the buildings …
street

Why SF?
Some of us live in the north bay to Santa Rosa and beyond.
Some of us live out in the east bay out to Walnut Creek and beyond.
And of course some of us folks live down in Cupertino, MtV, and Sunnyvale.

Our space is nicer than we deserve - bad omen or not - etrade bought the building during the height of the boom and decked it out with $17M in TI’s.
Then before they could move in they got adjusted - along with most of us.

… can’t really tell here but its a nice space …
inside
Above pic is our patch on the 4 where we keep it dark.
4th floor is all dev, no lights, lots of coffee, lots of booze (need to post pic of liquor cabinet and kegs), foosball all the time, wii all the time etc, bad jokes, etc. all serious productivity enhancers.

Splunking the most abundant time based dataset on the planet

What is it the most abundant time-based data set that *everyone* works with?

It ain’t logs - Its email.

if you think about it, email messages are a bit “event like” - they have timestamp, somewhat structured header, and payload.

Since splunk was designed for time based datasets it’s only natural that we hook it up to email. I’m not suggesting that you use splunk as your mail reader ( although i’m working on a few actions for forward, reply, etc ) but that in a datacenter, email often carries critical workflow information.

In our own infrastructure we have systems generating email notifications for things like support cases, changes to source code, open bugs, etc. Its interesting to bring the mail into the mix with my logs, config changes, etc. Once my mail is indexed I can instantly report on frequency of customer issues (support case email), changes in source code by file/user (perforce checkin/diff email), coded bugs by user per week (Jira bug notification emails), or just report on my own inbox - messages by size by time/sender/etc.

Ripping mulitline events at seach time

I relaized that as part of the previous monitoring bundle post i forgot to explain something cool/critical.

When we first conceived of the scripted inputs we used ps, top, netstat, as examples. It was going to be so easy and cool to eat ps output and get graphs of VM usage by process. Totally obvious until we tried it. The ps output in splunk works best as one event, with the header at the top and a repeated line per process:

( click to enlarge )
posout

Looks great! I can search for “sourcetype::ps splunkd” and get back all the times splunkd was running. But the problem comes when wanting to report on VM usage. How do i get our kv extractor to support a search that is “average VSZ for splunkd by time”. In our search langauge ( or using the UI ) you can say something like:

"sourcetype::ps splunkd | stats avg(VSZ) by _time

What we want is to produce a table and graph that is the average value for the key “VSZ” over time for just the one the process “splunkd”. But the above wont do that as there is no VSZ = in the “event” and worse than that, there are many values drawn out in the VSZ column.

single point of failure

Many businesses work hard building out redundant / fault tolerant systems.
We are not excpetion - we buy lots of hardware.
We use many critical business systems - crm, sfa, bug/case tracking, source control, etc… we got them all.

But there remains unfortunately one hardware device that is a single point of failure - and its our most critical hardware component.
When it goes down, development stops, tempers flare, work stops.

We need to get a second warm standby, backup battery power ( someone named Deep turned it off one day ), 24×7 support contract ( its made in Italy so not sure how that works ).

Here is a picture of our most important system ( need a rack mount for it )
Ummmmmm……..
coffee
( thanks Johnvey Hwang - Mr. Coffee )

popeness - Splunk’s all you can eat for $5.99

When most folks think of Splunk - they think of our log file search engine (and of course our ad’s staring Mark our honest-to-god support/sysadmin guru and ya the cool teeshirts, etc ).

But, I don’t really use Splunk for logs that much. Don’t get me wrong, logs are useful when indexed, but i like to feed Splunk with lots of other stuff.

In particular, i go after things like email messages ( not the logs, the mail itself ), OS resource info, raw network traffic, and configuration files, to just name a few - so that i can, as we say around the office, “Splunk the Datacenter”.

I find that logs by themselves are useful, but when combined with other information such as historical snapshots of vmstat, iostat, ps, top, etc AND when eating all the configs on the box - then I have everything I need to figure out what is going on.

tagline mindfull

Splunk’s Chief Mind, Mr. David Carasso is (in)famous around the office. Partly for his brilliant ( rather clever ) software algorithms but perhaps more notoriously for his ability to create ingenious “tag lines”.

EandD
(guess which one is David and which one is Me)

Splunk owes a good chunk of its brand to Mr. Carasso’s most popular line “Take the sh out of IT”. This tag line single handedly created lines 4 deep at trade shows for our black with white text “take the sh out of IT” teeshirts.

But that was not his only tag line - he is rather prolific.
Daivd posted a few on his blog

There are tons more that are NSFW and we should find some place “appropriate” to post those - they are among the funniest.

Here are a few that he did not list on his blog I found by scanning my inbox:

Be war eof logs

ROFLMAO - rolling our files, logging my ass off

The power of egrep, the grace of relational databases

All this from the Splunk’s brilliant but rather quirky Chief Mind

Thanks David!

BTW, if you have a tag line - post it here.
Maybe we will do something more interesting with them all.

Cheers,
e

rapid coalesence phase of software project lifecycle

This is not the first time its happened, and i don’t really keep track, but its seems more common than not with us.

Several weeks before we try and ship something substantial, we enter what most people would traditionally call “code freeze”. For us that means something more like “no new features unless they are really important - freeze” - and we try hard to shore up all the loose ends. When asked, the exceedingly smart folks who build and are responsible for splunk all say we are close and its just bit longer - but…. as someone looking at the final and integrated product NOTHING WORKS - not even close.

There are periods of doubt - that feeling that there is no way we can bring this around and that we are months behind schedule. But again, the kick ass engineers at Splunk keep saying we are close and to just hang in there. There is something about looking at the problem from the bottom up / inside out. When your writing the code you see the momentum and the gaps. When your outside, you just see coredumps and crash logs ;-)