<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>ewoo</title>
	<atom:link href="http://blogs.splunk.com/ewoo/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.splunk.com/ewoo</link>
	<description>Just another WordPress weblog</description>
	<pubDate>Wed, 30 Jan 2008 22:14:19 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Your most important IT data: funny quotes</title>
		<link>http://blogs.splunk.com/ewoo/2008/01/30/your-most-important-it-data-funny-quotes/</link>
		<comments>http://blogs.splunk.com/ewoo/2008/01/30/your-most-important-it-data-funny-quotes/#comments</comments>
		<pubDate>Wed, 30 Jan 2008 22:14:19 +0000</pubDate>
		<dc:creator>ewoo</dc:creator>
		
		<category><![CDATA[dev]]></category>

		<category><![CDATA[hacks]]></category>

		<category><![CDATA[bash]]></category>

		<category><![CDATA[bash.org]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/ewoo/2008/01/30/your-most-important-it-data-funny-quotes/</guid>
		<description><![CDATA[bash.org is a natural dataset for splunking. It&#8217;s a huge blob of loosely structured text data, and it&#8217;s made of win.
To play with a live instance, go to bash.splunklabs.com, login: guest, password: guest.
Of course, Splunk duplicates the functionality of the site itself. We can find, for example, the top 100 IRC quotes:

Splunk lets us do [...]]]></description>
			<content:encoded><![CDATA[<p>bash.org is a natural dataset for splunking. It&#8217;s a huge blob of loosely structured text data, and it&#8217;s made of win.</p>
<p>To play with a live instance, go to <a href="http://bash.splunklabs.com">bash.splunklabs.com</a>, login: guest, password: guest.</p>
<p>Of course, Splunk duplicates the functionality of the site itself. We can find, for example, the top 100 IRC quotes:</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/01/top_irc.png"><img src="http://blogs.splunk.com/devuploads/2008/01/top_ircpng.jpg" /></a></p>
<p>Splunk lets us do considerably more, though. What are the top one-liners?</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/01/top_one_liners.png"><img src="http://blogs.splunk.com/devuploads/2008/01/top_one_linerspng.jpg" /></a></p>
<p>How many more quotes mention &#8220;girlfriend&#8221; than &#8220;boyfriend&#8221;, i.e. exactly how bad is this sausage party?</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/01/gf_vs_bf.png"><img src="http://blogs.splunk.com/devuploads/2008/01/gf_vs_bfpng.jpg" /></a></p>
<p>Are there any commonly quoted individuals?</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/01/nicks.png"><img src="http://blogs.splunk.com/devuploads/2008/01/nickspng.jpg" /></a></p>
<p>Are there any interesting trends in quote scores over time? Take a look at high quote scores vs. quote ID:</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/01/max_score_vs_id.png"><img src="http://blogs.splunk.com/devuploads/2008/01/max_score_vs_idpng.jpg" /></a></p>
<p>It seems likely that older quotes, especially good ones, benefit from a disproportionately greater number of views (the rich getting richer, so to speak); this might explain why the peaks in the low-quote-ID ranges are higher than the peaks for more recent quotes. Or maybe the internet just doesn&#8217;t produce the same quality of LOLs that it once did.</p>
<p>To try this yourself, add the following to props.conf:</p>
<p><code>[sourcetype::bash]<br />
BREAK_ONLY_BEFORE = (#[0-9]* \+)|([0-9]+-[0-9]+-[0-9]+-[0-9]+-[0-9]+-[0-9]+)<br />
REPORT-bash = bash</code></p>
<p>and the following to transforms.conf:</p>
<p><code>[bash]<br />
REGEX = #([0-9]+) \+\((-?[0-9]+)\)- \[X\]<br />
FORMAT = $0 bash_quote_id::$1 bash_quote_score::$2</code></p>
<p>Then, get a static copy of bash.org. You can grab the one I&#8217;ve created <a href="http://blogs.splunk.com/devuploads/2008/01/bashtxt.zip">here</a>, or you can generate it yourself:</p>
<p><code>$ curl -o '#1.html' 'http://bash.org/?browse&amp;p=[001-409]&#8216;<br />
$ for cur in * ; do lynx -dump -nonumbers ./$cur >> /tmp/bash.txt ; done</code></p>
<p>Finally, push the data into Splunk:</p>
<p><code>$ splunk add tail -source /tmp/bash.txt -sourcetype bash</code></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/ewoo/2008/01/30/your-most-important-it-data-funny-quotes/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
