Parallel Data Transfer in Shuttl 0.8.0
In the world of Big Data, parallelism and distribution are key aspects of any successful technology. Shuttl was designed from the ground-up with those things in mind by piggy-backing on Splunk’s architecture. Up until now, moving data from Splunk to another data store was done in parallel, however, many administrative and restoration actions were more cumbersome without one key aspect: orchestration.
Shuttl 0.8.0 was released just two months after the prior 0.7.x release, and includes such orchestration, resulting in three significant main features:
- Parallel List
- Parallel Data Pull
- Parallel Data Flush
Before we move on to the new features, if you aren’t familiar with Shuttl, I’d recommend reading the prior blogs on the topic. (http://blogs.splunk.com/?s=shuttl)
Shuttl provides a way to move data for both archiving and sharing. It also supports several backend storage systems, including Hadoop HDFS, Amazon S3, Amazon Glacier, and plain NFS. This allows for flexible storage choices.
Each Shuttl node will independently move data based on Splunk’s frozen-script mechanism. For other actions, a REST API is available on each node, and until now, it was up to the user to perform an orchestrated action, usually via a script using curl.
This has changed in 0.8.0!
As it turns out, the Splunk Search Head is the perfect location to provide orchestrated actions across the cluster. So in 0.8.x, when Shuttl is installed on the Search Head, Shuttl will know about all the Search Head’s “Search Peers” (i.e. all the nodes of the Splunk cluster) and now via the UI on the Search Head you can orchestrate all the actions across the entire Splunk cluster!
Let’s go through each action one-by-one.
From the Search Head, you can list all the buckets that have been Shuttled from the Splunk cluster, regardless of number of nodes of the cluster. (Shuttl just iterates over the list of known indexers to call their respective REST APIs) In addition, the listing can be constrained by time ranges, so you can select just the buckets you are interested in.
This is often known as a “Thaw” operation in Splunk. In Shuttl 0.8.x, you can orchestrate a parallel pull from the backend to each individual Splunk node’s “Thaw” location. (Index buckets that get placed in the Thaw area ensures the data is not aged out based on the normally defined policy. It stays there until explicitly “Flushed.”)
This provides incredible power for pulling data back into the original Splunk instance, or to a cloned instance.
The final operation that is orchestrated in parallel is the “Flush” operation.
When you are done with the data that has been brought into the Thawed location, then you can, again, list out all the Thawed buckets, select what you don’t want, and orchestrate a parallel Flush operation across all Splunk nodes.
Shuttl and Beyond
The above additions allow for convenient management of data across its life-cycle. We’ve gotten more feature requests for Shuttl, and are happy to hear even more. Keep them coming.