<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>rob</title>
	<atom:link href="http://blogs.splunk.com/rob/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.splunk.com/rob</link>
	<description>Just another WordPress weblog</description>
	<pubDate>Wed, 06 Feb 2008 02:33:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>The SSL Performance Odyssey</title>
		<link>http://blogs.splunk.com/rob/2008/02/04/the-ssl-performance-odyssey/</link>
		<comments>http://blogs.splunk.com/rob/2008/02/04/the-ssl-performance-odyssey/#comments</comments>
		<pubDate>Mon, 04 Feb 2008 20:14:37 +0000</pubDate>
		<dc:creator>rob</dc:creator>
		
		<category><![CDATA[dev]]></category>

		<category><![CDATA[tech]]></category>

		<category><![CDATA[Python Bugs Performance]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/rob/2008/02/04/the-ssl-performance-odyssey/</guid>
		<description><![CDATA[When you come to dev.splunk.com, you see pictures of beer pong, full bars, stuffed ponies with fart machines taped to their ass, etc - basically engineers gone wild.  Somewhere between all of this insaneness, we actually find the time to write code and solve problems like this one.This post is all about a crazy-weird [...]]]></description>
			<content:encoded><![CDATA[<h3>When you come to dev.splunk.com, you see pictures of beer pong, full bars, stuffed ponies with fart machines taped to their ass, etc - basically engineers gone wild.  Somewhere between all of this insaneness, we actually find the time to write code and solve problems like this one.This post is all about a crazy-weird performance issue that we were experiencing, how it manifested itself and ultimately how it was fixed.</h3>
<p>I suspect others may be having this problem, as the problem lives in some <em><strong>very</strong></em> popular open source code as far as I can tell.   With that, I&#8217;ll begin telling you about my journey into hell.</p>
<p>Splunk has a home grown embedded HTTP(S) server that serves up all external interfaces to the &#8217;splunkd&#8217; daemon.   We use it as the core engine for our REST and XML/RPC-like API&#8217;s.  The GUI and the CLI both end up talking to the daemon via this server.</p>
<p>When I wrote the core of it a few months ago, I ran some rudimentary performance tests on several platforms and it seemed decent enough for our use, but a week ago, the manager of the Search and Indexing team (Stephen) said that he was seeing <em>abysmal </em>performance using SSL.  He said that the GUI performance was being impacted.  I didn&#8217;t believe him and insisted that it was something else and that he was high.</p>
<p>So to prove to him that it wasn&#8217;t <em>my </em>server, or <em>my </em>problem like all engineers do, I gave him a small python script that hits the server in a tight loop and we checked the performance.  It sucked.  Continuing with the theme of &#8220;this isn&#8217;t my problem&#8221; - I told him it was probably the handler of the request that was doing something that made the server seem slow.  This is when he laughed at me and said &#8220;watch this&#8221;:  He proceeds to turn off SSL, re-run the same test and the performance of the server goes up by approximately 50X.  <strong>50 times faster!</strong>    I know that SSL is slower than non-encrypted streams, but there was no way this was the problem.  Whoa!  We can&#8217;t ship this way.  This needs to be fixed!</p>
<p>In fact, a very small HTTP request (approx. 80 byte)  with a small reply (approx. 300 bytes) was operating at only 23 requests/sec!  When he turned off SSL, he was getting over 1000 req/sec!  What???</p>
<p>So, of course I tried the same test on my OSX laptop and I got 130+ req/sec - within the realm of reasonable and certainly better than 24.  I then tried running the server on my laptop and the client on my Linux Fedora machine resulting in basically the same performance.   Why does this work on my hardware and not his?</p>
<p>Finally, I switched the server and client by putting the server on my Linux box and the client on my Mac.  I re-ran the test and damned if the performance didn&#8217;t completely suck!  I was getting 20 or so request-replies per second over SSL.</p>
<p><strong>  But, why does the OS matter?  I didn&#8217;t get it.</strong></p>
<h3>My SSL Performance Bug Diary</h3>
<ul>
<li>Broke out ssldump.  Here is a snippet from an OSX client and a Linux server.  Note the third C-&gt;S line of .0398 seconds.  This is the cause of the slowdown, but why?</li>
</ul>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/ssldumpold.jpg" title="SSL Dump Slow"><img src="http://blogs.splunk.com/devuploads/2008/02/ssldumpold.jpg" alt="SSL Dump Slow" /></a></p>
<ul>
<li>Spent 2 hours looking over every possible OpenSSL build option and try turning various ones on and off.  No difference.  (score: Bug 1, Rob 0)</li>
<li>Spend many hours trying different crypto combinations.  Little difference beyond the obvious and documented performance differences.  (score Bug 2, Rob 0)</li>
<li>Perhaps I need to throw in server-side SSL caching.  I throw it in, with the assumption that the python client implements client-side SSL caching.  No performance change.  (score:  Bug 3, Rob 0)</li>
<li>Thinking it might be the Nagle algorithm, I modify my test to send larger requests and guess what?  The performance is normal again!   I try to find out exactly when it turns from slow to fast (as far as the request size) by trying request sizes of 1, 2, 4, 8, 16, 32&#8230;&#8230;..16K bytes.  Wow, just around 1300-1400 bytes is where the performance goes from sucks to fast.  Look at the graph below.  See the spike?  Hmmm&#8230;.. (score: Bug 3, Rob 1)</li>
</ul>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/mtuspike1.jpg" title="mtuspike1.jpg"><img src="http://blogs.splunk.com/devuploads/2008/02/mtuspike1.jpg" alt="mtuspike1.jpg" /></a><a href="http://blogs.splunk.com/devuploads/2008/02/mtuspike.jpg" title="mtuspike"></a></p>
<ul>
<li>I change the MTU on the server from the default of 1500 bytes to 1000 bytes. The performance cliff now is lowered to somewhere in the 800-900 byte range. The MTU is the key! (score: Bug 3, Rob 2)</li>
<li>It&#8217;s got to be the Nagle algorithm.  I try turning off the Nagle algorithm on the server.  No performance change.  (score: Bug 5, Rob 2)</li>
<li>I give the problem to our performance engineer.  He can reproduce it.  I suck.</li>
<li>Decide to try ssldump again and this time try a different test - curl sending the same size request as in the python test.  I want to compare timings.  BINGO.  It&#8217;s not the server, it&#8217;s a combination of the server running on Linux and <em><strong>Python.</strong></em> (score: Bug 5, Rob 3).  Notice in the following curl ssldump image, the single C-&gt;S line and the fast .0007 second timing.  Contrast this to the previous ssldump image and here enlies the problem :</li>
</ul>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/ssldumpcurl.jpg" title="ssldump curl"><img src="http://blogs.splunk.com/devuploads/2008/02/ssldumpcurl.jpg" alt="ssldump curl" /></a></p>
<ul>
<li>Now to fix it.  It really really seems like Python is the problem.  I try it with urllib2.  Same thing.</li>
<li>I try it with httplib2.  Same thing.</li>
<li>I look at the code for urllib2 and httplib2 and guess what?  They both use httplib.  The problem must be in httplib.  I dig into the code and start commenting shit out and looking at the resulting ssldump output to figure out *exactly* which write is causing the damage.  I find the bug. (score:  Rob wins)</li>
</ul>
<h3>The Problem and the Fix</h3>
<p>I forgot to tell you that we are using Python 2.5.  It turns out that <em>httplib.py</em> sends requests over the wire in 2 chunks.  The first chunk is comprised of the HTTP headers.  The second chunk is the body.  The fix I made appends the body to the headers and sends the request in 1 chunk only.  This is what curl does and this fixes the performance problems.</p>
<p>Here is the fix for download:</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/httplib.py" title="httplib.py">httplib.py</a></p>
<p>Here is my final data:</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/fullgraph.jpg" title="fullgraph.jpg"><img src="http://blogs.splunk.com/devuploads/2008/02/fullgraph.jpg" alt="fullgraph.jpg" /></a></p>
<h3>Things I Still don&#8217;t Understand</h3>
<p>Because it seems to work and this took so damn long, I am not going to do any further investigations, but there are still many unsolved mysteries. Perhaps one of you can figure them out.</p>
<ul>
<li>Why the extreme falloff on linux where both the client and server are on the same machine at 16K request/reply size?</li>
<li>Why is OSX so much slower than linux?</li>
<li>Why does the new code speed up linux only?</li>
<li>Notice that only the OSX box gets the speed up at the MTU, the Linux box continues the slow performance regardless of the MTU</li>
</ul>
<h3>Windows to Linux Performance Numbers (added 2/5/08)</h3>
<p>So I added a Windows to Linux graph based on the first comment I received below.  Yes, we do test with Windows, and yes, it is not out yet (but will be soon).  The problem manifests itself exactly like it does on other platforms.  Notice the difference:</p>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/windows-linux.jpg" title="windows-linux.jpg"><img src="http://blogs.splunk.com/devuploads/2008/02/windows-linux.jpg" alt="windows-linux.jpg" /></a></p>
<h3></h3>
<h3>Specs on the Test Hardware</h3>
<ul>
<li>Windows
<ul>
<li>Dual Core, very fast, lots of Ram (will provide detailed specs in a bit)</li>
</ul>
</li>
<li>Linux:
<ul>
<li>2.6.11-1.1369_FC4smp</li>
<li>3.4Ghz P4, Hyperthreaded, 2G Ram</li>
</ul>
</li>
<li>OSX
<ul>
<li>Mac Pro Laptop</li>
<li>1.8Ghz Pentium Core II duo (2 cores), 3G Ram</li>
</ul>
</li>
</ul>
<p><a href="http://blogs.splunk.com/devuploads/2008/02/ssldumpold.jpg" title="SSL Dump Slow"></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/rob/2008/02/04/the-ssl-performance-odyssey/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Diagraming Splunk’s data-flow (part 2 - performance overlays)</title>
		<link>http://blogs.splunk.com/rob/2007/10/11/diagraming-splunk%e2%80%99s-data-flow-part-2-performance-overlays/</link>
		<comments>http://blogs.splunk.com/rob/2007/10/11/diagraming-splunk%e2%80%99s-data-flow-part-2-performance-overlays/#comments</comments>
		<pubDate>Fri, 12 Oct 2007 00:49:01 +0000</pubDate>
		<dc:creator>rob</dc:creator>
		
		<category><![CDATA[Homepage]]></category>

		<category><![CDATA[dev]]></category>

		<category><![CDATA[hacks]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/rob/2007/10/11/diagraming-splunk%e2%80%99s-data-flow-part-2-performance-overlays/</guid>
		<description><![CDATA[In my previous post &#8220;Diagraming Splunk&#8217;s data-flow&#8221; I wrote a small python script that parsed Splunk&#8217;s runtime environment ($SPLUNK_HOME/var/run/splunk/composite.xml) and generated a file which when input into graphviz would generate a nice architectural diagram of how pipelines and processors are wired together.
In this installment, I took it to the next level by using Splunk&#8217;s search [...]]]></description>
			<content:encoded><![CDATA[<p>In my previous post &#8220;Diagraming Splunk&#8217;s data-flow&#8221; I wrote a small python script that parsed Splunk&#8217;s runtime environment ($SPLUNK_HOME/var/run/splunk/composite.xml) and generated a file which when input into <a href="http://www.graphviz.org/">graphviz</a> would generate a nice architectural diagram of how pipelines and processors are wired together.</p>
<p>In this installment, I took it to the next level by using Splunk&#8217;s search capability to overlay performance metrics on the diagram.  The combination of Splunk logging metrics information for each processor within each pipeline (thanks Brad) and the ability to have Splunk execute a <em>search processor </em>written in Python made this possible.  Here is how you use it:</p>
<p>First download <a href="http://www.graphviz.org/">graphviz</a>.  I particularly like the OSX application that they&#8217;ve written because you can see the graph on the screen and as the file changes, those changes are reflected in the graph you are viewing.  If you don&#8217;t have a Mac, use the command line version to generate different types of output file formats like .jpeg, etc.</p>
<p>Go to <a href="http://www.splunkbase.com/addons/Search_Commands/Splunk/Performance_tuning/addon:Perfgraph">SplunkBase</a> to download my python script.  Copy the .py file into $SPLUNK_HOME/etc/searchscripts</p>
<p>Start Splunk.</p>
<p>Type the following into the search box:<img alt="index___internal metrics pipeline processor NOT get - over all time - localhost - Splunk 3.2-UNSTABLE-4.jpg" src="http://blogs.splunk.com/devuploads/2007/10/index___internal%20metrics%20pipeline%20processor%20NOT%20get%20-%20over%20all%20time%20-%20localhost%20-%20Splunk%203.2-UNSTABLE-4.jpg" /><br />
This will search for the appropriate metrics information and pipe the results through the script.</p>
<p>There are 2 options to perfgraph:</p>
<p><em>perfgraph [output filename] [cpu, execs, cumhits]</em></p>
<p>Unfortunately (because I&#8217;m lazy) you can&#8217;t specify cpu, execs or cumhits without also specifying an output file.The  parameter is the full path and file name of the &#8216;dot&#8217; file you wish to create.  It defaults to /tmp/out.dot.</p>
<p>The second parameter, if specified tells the script to highlight in red the slowest processor (cpu), the processor with the most hits (execs) or the processor with the most cumulative hits (cumhits).  This parameter defaults to &#8216;none&#8217;, or no highlighting.</p>
<p>The above search string results in the following graph (portion).  Notice the performance information overlayed into the processors:<br />
<img alt="out.dot-1.jpg" src="http://blogs.splunk.com/devuploads/2007/10/out.dot-1.jpg" /></p>
<p>If you specify the output file and &#8216;cpu&#8217;, the processor with the most cpu time will be highlighted.  Here is the search:</p>
<p><img alt="index___internal metrics pipeline processor NOT get | perfgraph _tmp_out.dot cpu - over all time - localhost - Splunk 3.2-UNSTABLE.jpg" src="http://blogs.splunk.com/devuploads/2007/10/index___internal%20metrics%20pipeline%20processor%20NOT%20get%20%7C%20perfgraph%20_tmp_out.dot%20cpu%20-%20over%20all%20time%20-%20localhost%20-%20Splunk%203.2-UNSTABLE.jpg" /></p>
<p>It results in the following graph (portion).  Notice the red processor:</p>
<p><img alt="out.dot-2.jpg" src="http://blogs.splunk.com/devuploads/2007/10/out.dot-2.jpg" /></p>
<p>Next steps:</p>
<ul>
<li>Overlay queue metrics into the queue nodes</li>
<li>Overlay indexer throughputs into the indexer nodes</li>
</ul>
<p>You see.  Splunk provides endless fun.  Insane!  Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/rob/2007/10/11/diagraming-splunk%e2%80%99s-data-flow-part-2-performance-overlays/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Diagraming Splunk&#8217;s data-flow</title>
		<link>http://blogs.splunk.com/rob/2007/10/10/diagraming-splunks-data-flow/</link>
		<comments>http://blogs.splunk.com/rob/2007/10/10/diagraming-splunks-data-flow/#comments</comments>
		<pubDate>Wed, 10 Oct 2007 16:57:59 +0000</pubDate>
		<dc:creator>rob</dc:creator>
		
		<category><![CDATA[Homepage]]></category>

		<category><![CDATA[dev]]></category>

		<category><![CDATA[hacks]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/rob/2007/10/10/diagraming-splunks-data-flow/</guid>
		<description><![CDATA[This blog entry is not about how the framework works.  It is about a semi-cool visualization that I created using python and graphviz. If you watched the video where I presented Splunks framework architecture from a high level you know what pipelines and processors are.  If you haven&#8217;t here is a very quick [...]]]></description>
			<content:encoded><![CDATA[<p>This blog entry is not about how the framework works.  It is about a semi-cool visualization that I created using python and <a href="http://www.graphviz.org/">graphviz</a>. If you watched the video where I presented Splunks framework architecture from a high level you know what pipelines and processors are.  If you haven&#8217;t here is a very quick overview.</p>
<ul>
<li>A <strong><em>pipeline</em></strong> is a thread of execution that lives within the splunkd process.  Each pipeline executes a series of <strong><em>processors</em></strong>, each one which operates on data.  The data is created when the first processor on the pipeline reads it from some input (like tailing a file, or receiving it on a network port).  Each processor then does something to the data.  Eventually, the data gets indexed and execution is returned to the first processor to get more data again.</li>
</ul>
<ul>
<li>Pipelines are connected via <strong><em>queues</em></strong>. A queue output processor (the last processor in a pipeline) puts data on to a queue and blocks if the queue is full.  A queue input processor (the first processor at the top of a pipeline) gets the data item from the bottom of the queue and sends it on down the pipeline. If there is no data, it blocks waiting for some to be put on the queue.</li>
</ul>
<p>Enough already.  Go watch the video.  So, I decided that I&#8217;m tired of drawing these diagrams and wrote some code to produce them for me.</p>
<p>I Implemented some python code that took the <em>composite.xml </em>file, parsed it and produced a <em>.dot </em>file.  Composite.xml, for those of you who don&#8217;t know is an amalgamation of all pipelines and processors in the system.  It represents the current (or last) runtime environment for Splunk.  It lives in $SPLUNK_HOME/var/run/splunk.</p>
<p>I then took the resultant .dot file and ran it through  <em><a href="http://www.graphviz.org/">graphviz</a>.</em> After lots of tweeking, here is what I came up with.  Click on the image to see a larger version which is actually readable.</p>
<p><strong>Results </strong>(click to enlarge)<br />
<a href="http://blogs.splunk.com/devuploads/2007/10/test.jpg"><img width="253" height="177" alt="Auto-generated pipeline graph" src="http://blogs.splunk.com/devuploads/2007/10/test.jpg" /></a></p>
<p><strong>Python Transformation Code</strong></p>
<p>Untar this.  It&#8217;s only a single python file, but this blogging software wouldn&#8217;t let me upload a .py file.</p>
<p><a href="http://blogs.splunk.com/devuploads/2007/10/viz.tar">viz.tar</a></p>
<p><strong>Future Work</strong></p>
<ul>
<li>Annotate the graph with run time statistics like average per-processor timing, average queue size, max queue size, etc.  This would require looking at the logs.</li>
<li>Launching this from Splunk, firing off the python along with the metrics data pre-sifted ala Splunk.</li>
</ul>
<p>Got more ideas?  Please post them here.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/rob/2007/10/10/diagraming-splunks-data-flow/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The framework team is hiring</title>
		<link>http://blogs.splunk.com/rob/2007/10/09/the-framework-team-is-hiring/</link>
		<comments>http://blogs.splunk.com/rob/2007/10/09/the-framework-team-is-hiring/#comments</comments>
		<pubDate>Tue, 09 Oct 2007 18:52:30 +0000</pubDate>
		<dc:creator>rob</dc:creator>
		
		<category><![CDATA[dev]]></category>

		<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/rob/2007/10/09/the-framework-team-is-hiring/</guid>
		<description><![CDATA[Splunk&#8217;s framework team is involved in many diverse projects. The &#8220;framework&#8221; itself is really a set of generic code that makes up the runtime environment of Splunk.  In addition, we also handle bringing data into the system, distributing this data across enterprise topologies, authentication, access controls, configuration management, distributed deployment, high availability, real-time streaming, [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk&#8217;s framework team is involved in many diverse projects. The &#8220;framework&#8221; itself is really a set of generic code that makes up the runtime environment of Splunk.  In addition, we also handle bringing data into the system, distributing this data across enterprise topologies, authentication, access controls, configuration management, distributed deployment, high availability, real-time streaming, encryption and much much more.</p>
<p>Splunk is extending it&#8217;s reach into extremely large deployments involving thousands of machines and devices across multiple data centers.  The framework team is responsible for making Splunk excel in these challenging environments.  If this sounds interesting and you want to work with some extremely talented people, please drop me some email.</p>
<p><strong>Framework Architect / Senior Engineer</strong></p>
<p>We are looking for a highly motivated engineer who will be responsible for driving the design and implementation of Splunk&#8217;s network management, scalability, and distributed deployment technology.  The right candidate is fluent in C++, high performance networking and  concurrent / multi-threaded design.</p>
<p><strong>Qualifications</strong></p>
<ul>
<li>Minimum 5 years of relevant industry experience</li>
<li>Expert C++ knowledge, deep understanding of design patterns and experience building clean external API&#8217;s.</li>
<li>Significant experience with multi-threaded design and implementation</li>
<li>Has designed &amp; implemented high throughput server systems</li>
<li>Practical experience with network protocols and complex topologies</li>
<li>BS/MS Computer Science / Engineering</li>
<li>Excellent verbal and written communication skills</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/rob/2007/10/09/the-framework-team-is-hiring/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Software configuration - why does this wheel need re-invention?</title>
		<link>http://blogs.splunk.com/rob/2007/10/02/software-configuration-why-does-this-wheel-need-re-invention/</link>
		<comments>http://blogs.splunk.com/rob/2007/10/02/software-configuration-why-does-this-wheel-need-re-invention/#comments</comments>
		<pubDate>Wed, 03 Oct 2007 02:01:09 +0000</pubDate>
		<dc:creator>rob</dc:creator>
		
		<category><![CDATA[dev]]></category>

		<guid isPermaLink="false">http://blogs.splunk.com/rob/2007/10/02/software-configuration-why-does-this-wheel-need-re-invention/</guid>
		<description><![CDATA[I have worked on so many software projects that I can&#8217;t possibly enumerate them.  Most of my contribution to these projects has been on the server side of things.  Every one of these projects needed to be configured in some way, shape or form and I just realized that every one of them [...]]]></description>
			<content:encoded><![CDATA[<p>I have worked on so many software projects that I can&#8217;t possibly enumerate them.  Most of my contribution to these projects has been on the server side of things.  Every one of these projects needed to be configured in some way, shape or form and I just realized that every one of them had it&#8217;s own configuration subsystem that was implemented from scratch.  Many of these configurations could be managed via GUI&#8217;s and/or CLI&#8217;s, and others simply were &#8220;managed&#8221; via vi, or emacs.  They all share one thing in common however - they all suck in one way or another.  Why?  Because configuration subsystems are incredibly difficult to get right.</p>
<p>Building a configuration system on the surface seems boring.  If I went and showed the sales guys how cool my configuration system was they would roll their eyes back into their heads.   Put some rotating, flashing thing on the GUI and they think you&#8217;re the coolest, most creative developer around.  The fact is that a good configuration system makes a <em>huge </em>difference to a product.  In fact, it can make or break it in some cases.</p>
<p>Indulge me in allowing me to share a typical &#8220;configuration system lifecycle&#8221;.  Please tell me if this seems familiar to you.  I have personally gone through this many times.</p>
<ul>
<li>Version 1.0 - simple configuration language, usually XML.  Why?  Because you need to get something up and running quickly.  XML has tons of parsers, validators, etc.  Users of this early release need to edit the configuration files using a text editor.  They need to restart the system every time a change is made. The developer states that this is fine - the product is &#8220;not intended for use by people that can&#8217;t use an editor&#8221;. Fuck em&#8217;.</li>
<li>Version 1.5 - The next release has some really complex configuration.  However, it&#8217;s still only modifiable via a text editor.  Maybe flow control is introduced.  Changing a configuration in the wrong way causes very bad and very weird things to happen.  Customer Support gets lots of calls.  There is no way to tell what a customer changed and what the default configuration was supposed to be without comparing the two configuration files side by side.</li>
<li>Version 2.0 - We need an adminstration GUI so people can configure this without have to call support every single time!  So a GUI is added.  Every administered item is coded into the server and into the GUI because every configuration has different validation, different things to check, etc.  The customers are much happier.  Until graybeard decides he hates the GUI and insists on using emacs.  The GUI and emacs don&#8217;t get along very well.  Things break again.</li>
<li>Version 2.5 - The executives decide that we need a way for &#8220;the community&#8221; to build widgets that other people can use.  They need to package these widgets up in some way that they can be downloaded and added to the system without disturbing local and default configurations.  The engineers decide to use layering to separate these 3 things out.  But layering in XML is nasty and people will get confused.  So out with the XML to something &#8220;simpler&#8221;.  Boy did this open a can of worms.  All the different parts of the system need to be modified to handle the new configuration syntax.  We are just about ready to ship.  Boy is this code base different - &#8220;Oh SHIT! We forgot we need migration scripts!&#8221;.  So they are frantically built and hastily tested.  The product ships.  Customers complain.  Not only do the migration scripts hork periodically, but the configuration language is new to them.</li>
<li>Version 3.0 - The server engineers are adding lots of new features to support customer requirements.  Unfortunately, every new feature needs custom GUI and CLI work to handle the administration of that feature.  This is simply not sustainable, so it&#8217;s been decided to data drive the GUI and CLI from a specification file that describes the syntax, the interdependencies, etc for each configuration item/file.  Furthermore, the community is going gangbusters, but downloading new widgets requires a restart of the server.  So does all configuration changes.  Once again every part of the system is changed to handle this dynamic configuration.  Man is this hard - &#8220;what do I do with the data that is already in the queues when the queue is supposed to be shrunk in this re-configuration, asks one of the brightest engineers?&#8221;  Hmm.</li>
</ul>
<p>You get the idea.<br />
So here in a nutshell is a list of reasons why configuration systems are so difficult.  I&#8217;m sure you can add more:</p>
<ul>
<li>They are actually small languages.  I have seen XML, simple linear lists of attribute/value pairs, scripting languages with flow control, strange and weird languages like in sendmail, etc.</li>
<li>They need validation so they don&#8217;t break the system</li>
<li>If there is GUI or CLI access, they need to be dynamically updated</li>
<li>Consistency between updates is critical so that someone editing a config file using via doesn&#8217;t collide with someone using the GUI.</li>
<li>They need to be migrated from version to version or need some kind of backward compatability</li>
<li>They can be layered so local changes override system defaults</li>
<li>They need to be extensible so ultimately 3rd parties can develop configurations that are add-ons</li>
<li>They need solid documentation - ultimately self generating.</li>
<li>They should be data driven such that every time someone invents something that needs new configuration, the GUI and/or CLI doesn&#8217;t need new code.</li>
<li>They need to support dynamic loading with no system restarting</li>
<li>They may need to support versioning in systems that are composed of modules, each which may be independently revved.</li>
</ul>
<p><strong>Conclusion</strong></p>
<p>Configuration systems are often overlooked, but can be the core of an entire system.  There is <em>no </em>substitute for a really good one.  It&#8217;s almost impossible to get it right the first time, but you must really think long and hard about where you want it to go and what you want it to become.</p>
<p>Yes.  I copped out.  I didn&#8217;t tell you how to do these things.  I didn&#8217;t tell you where you can look on SourceForge to find the ultimate configuration system so you don&#8217;t need to re-invent the wheel yet again.  That is because there is none - at least not that I know of.  I have some ideas on how to build a generic configuration system that if open-sourced could save engineers months of time, but that is the topic of a different post.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.splunk.com/rob/2007/10/02/software-configuration-why-does-this-wheel-need-re-invention/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
