Splunking Kafka At Scale
At Splunk, we love data and we’re not picky about how you get it to us. We’re all about being open, flexible and scaling to meet your needs. We realize that not everybody has the need or desire to install the Universal Forwarder to send data to Splunk. That’s why we created the HTTP Event Collector. This has opened the door to getting a cornucopia of new data sources into Splunk, reliably and at scale.
We’re seeing more customers in Major Accounts looking to integrate their Pub/Sub message brokers with Splunk. Kafka is the most popular message broker that we’re seeing out there but Google Cloud Pub/Sub is starting to make some noise. I’ve been asked multiple times for guidance on the best way to consume data from Kafka.
In the past I’ve just directed people to our officially supported technology add-on for Kafka on Splunkbase. It works well for simple Kafka instances, but if you have a large Kafka cluster comprised of high throughput topics with tens to hundreds of partitions, it has its limitations. The first is that management is cumbersome. It has multiple configuration topologies and requires multiple collection nodes to facilitate data collection for the given topics. The second is that each data collection node is a simple consumer (single process) with no ability to auto-balance across the other ingest nodes. If you point it to a topic it will take ownership of all partitions on the topic and consumes via round-robin across the partitions. If your busy topic has many partitions, this won’t scale well and you’ll lag reading the data. You can scale by creating a dedicated input for each partition in the topic and manually assigning ownership of a partition number to each input, but that’s not ideal and creates a burden in configuration overhead. The other issue is that if any worker process dies, the data won’t get read for its assigned partition until it starts back up. Lastly, it requires a full Splunk instance or Splunk Heavy Forwarder to collect the data and forward it to your indexers.
Due to the limitations stated above, a handful of customers have created their own integrations. Unfortunately, nobody has shared what they’ve built or what drivers they’re using. I’ve created an integration in Python using PyKafka, Requests and the Splunk HTTP Event Collector. I wanted to share the code so anybody can use it as a starting point for their Kafka integrations with Splunk. Use it as is or fork it and modify it to suit your needs.
Why should you consider using this integration over the Splunk TA? The first is scalability and availability. The code uses a PyKafka balanced consumer. The balanced consumer coordinates state for several consumers who share a single topic by talking to the Kafka broker and directly to Zookeeper. It registers a consumer group id that is associated with several consumer processes to balance consumption across the topic. If any consumer dies, a rebalance across the remaining available consumers will take place which guarantees you will always consume 100% of your pipeline given available consumers. This allows you to scale, giving you parallelism and high availability in consumption. The code also takes advantage of multiple CPU cores using Python multiprocessing. You can spawn as many consumers as available cores to distribute the workload efficiently. If a single collection node doesn’t keep up with your topic, you can scale horizontally by adding more collection nodes and assigning them to the same consumer group id.
The second reason you should consider using it is the simplified configuration. The code uses a YAML config file that is very well documented and easy to understand. Once you have a base config for your topic, you can lay it over all the collection nodes using your favorite configuration management tool (Chef, Puppet, Ansible, et al.) and modify the number of workers according to the number of cores you want to allocate to data collection (or set to auto to use all available cores).
The other piece you’ll need is a highly available HTTP Event Collector tier to receive the data and forward it on to your Splunk indexers. I’d recommend scenario 3 outlined in the distributed deployment guide for the HEC. It’s comprised of a load balancer and a tier of N HTTP Event Collector instances which are managed by the deployment server.
Once you’ve got your HEC tier configured, inputs created and your Kafka pipeline flowing with data you’re all set. Just fire up as many instances as necessary for the topics you want to Splunk and you’re off to the races! Feel free to contribute to the code or raise issues and make feature requests on the Github page.