Splunkgit – Github just got Splunked! (Part 1/4)
Knowledge is power ~ Sir Francis Bacon
I believe this to be true. Since data is the knowledge of the digital age, Splunk is all one needs to have power. For most of the readers of this blog, it’s no news that Splunk is a very powerful piece of software, but I just discovered this. My name is Emre Berge Ergenekon and this is my first blog post. I’m an computer science student at the Royal Institute of Technology, Stockholm Sweden, doing my masters thesis at Splunk. Together with Petter, our first assignment was to create a Splunk app. This turned out to be really fun. Thanks to Boris who gave us a great idea we were able to create something really cool.
We Splunked github!
Our goal was to visualize data that is retrievable through github. We wanted to present it in a way that helped developers to easily overview and analyze the status of a repository. However we didn’t want to tie the app to github, so it is also useful for non-github projects. We have therefor designed our Splunk app to separate the github and git repository code.
There are two sources the scripts retrieve the data from. One of them is github API and the other is git repository logs.
Getting data from github was easy with their v3 API. With python, we were able to fetch information about issues, watchers and forks of a repositories. You can simply get list of watchers on a github repo with curl (don’t forget to substitute user-name and repo-name):
In a similar way you can get info about the forks:
Most of the publicly available data is also available, without authentication, through the API.
Some important aspects of the API
All API access is over HTTPS. The requests are always made against the api.github.com domain.
The data returned by the API is always in the JSON format, also all data sent in the requests has to be JSON.
- Rate Limit
You can only make a total of 5000 requests per hour (it’s possible to get you application white listed for more requests). The limit and remaining count is present in the response header.
X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4711
If a request returns multiple items the number of items returned is limited to 30 per request. However this is the default value, using ?per_page parameter this value can be set up to 100.
In the request we make this value is always set to 100. This way we minimize the number of requests needed for collecting information. As an example, let the the repo Splunkgit have 230 watchers. To create a list of the watchers you need to make 8 requests with the default pagination value, 30. But with per_page set to 100 you can retrieve the same information with only 3 requests. As you’ll expect this makes the scripts run faster and doesn’t use much of you hourly rate limit quota.
For iterating the pages you can look up the link header. As an example the following header info is present in an response that has multiple pages:
Link: <https://api.github.com/repos?page=3&per_page=100>; rel="next", <https://api.github.com/repos?page=50&per_page=100>; rel="last"
So far we were able the retrieve data about the watcher and fork counts. While this information is great to monitor popularity of the repository, what really makes this app useful, as a tool, is the github issue data. We poll the API for a list of opened as well as closed issues. The splunked issue information is as follows:
- Issue number
The unique identifier of an issue.
- Issue State
Open or Closed.
- Comment count
The number of comments on the issue.
The github user name of the issues reporter.
The title of the issue
- Creation, update and close times
An value for the three different time values available.
The above information is later used the create dashboards for fast overviewing:
- Newest issues
- Latest updated issues
- Oldest unclosed issues
Where the API fell short
Retrieving all the forks originating from specified repo is actually an hard task to do which requires lots of requests. The task requires iterating over all the forks in all levels and building a list of them. At the end the size of this list is equal to the fork count that you can see in github.
A repository with 200 forks needs 200 requests to build this list. A sequential execution of the requests would take forever, thats why we used a library called joblib. Joblib makes the requests in parallel using specified amount of jobs.
list_of_returned_values = Parallel(n_jobs=<Number of jobs>)(delayed(feth_list_of_forks)(list_of_repos[i]) for i in range(len(list_of_repos)))
list_of_returned_values = feth_list_of_forks(list_of_repos) list_of_returned_values = feth_list_of_forks(list_of_repos) . . . list_of_returned_values[i] = feth_list_of_forks(list_of_repos[i])
The git log command is very powerfull. Without any need for parsing you can easily retrieve splunkable information about commits. To format the output of each commit on a single line and with the key=value pattern we used: –pretty=format argument for git log.
The format we used was:
--pretty=format:'[%ci] author_name="%an" author_mail="%ae" commit_hash="%H" parrent_hash="%P" tree_hash="%T"'
As you can see it’s very simple to use. There are also many more variables available then shown here.
Other necessary argument used were:
Generate the log using all the refs in the refs/ directory. This argument is necessary to retrieve commits from all branches.
Merge commits are ignored since those commits has multiple parent hashes.
This command skips the given number of commits from the log. The argument is used in conjunction with a splunk search to detect already splunked commits and skiping them when retrieving new data. In shell script following line can retrieve the data:
NUMBER_OF_COMMITS_TO_SKIP=`splunk search "index=splunkgit | stats dc(commit_hash) as commitCount" -auth admin:changeme -app Splunkgit | grep -o -P '[0-9]+'`
Retrieving the git data
But as you would expect there is no way to execute git log on the remote repo itself. The way to go was to make a local clone of the repository:
git clone --mirror <Repo address> <Directory to save repo>
The –mirror argument ensures that all new refs are fetched from the remote. It also implies the –bare argument, that is no working tree is created for this repository.
The clone operation is only performed when there isn’t a local repository present. Subsequent executions of out git log analyzing script starts with a git fetch to receive all the new data.
Splunk MAX_DAYS_AGO property
Splunk has a property called MAX_DAYS_AGO. This property specifies the oldest date to accept when retrieving data. Dates older than MAX_DAYS_AGO will be shown/searched using the current date. To avoid this we put the following text in props.conf file:
This will tell splunk that all data that has a name starting with git can be as old as 10000 days which is 27 years.