SMail: Splunking Your Inbox

Splunk GMail

Google sent me a nice message to start the year – “Your inbox is reaching its limit”.

Looking at my GMail inbox I have well over 70k emails, taking up just under 15GB of space. I’m interested in how this number is made up – who emails me the most, who I email, what time I’m most productive, etc.

I decided to download my GMail archive using Google Takeout to analyse the data. Here’s how I did it.

Download Your Inbox

Google Takeout

First, use Google Takeout to download your GMail mailbox. Depending on the amount of emails you have accumulated this might take a while. My ~15GB took about an hour.

Once complete, Google will give you a .zip file. Download and unzip it. You should see a file named something like “<my_gmail_inbox>.mbox”.

Upload the .mbox file to Splunk

If your confident with editing props.conf directly, ignore the next paragraph.

Using the file uploader in Splunk, select your .mbox file using the option “Preview Data Before Indexing”. We will use the data preview to teach Splunk what .mbox events look like so that they are indexed correctly.

Using the “Advanced Mode” tab you can create the props.conf in the GUI. To get data indexing correctly, I suggest a props.conf structure similar to the following:

[gmail-mbox] #remove this line if using the Splunk GUI "Advanced Tab"
MAX_EVENTS = 100000
BREAK_ONLY_BEFORE = From\s.+?@
MAX_TIMESTAMP_LOOKAHEAD = 150
NO_BINARY_CHECK = 1
TRUNCATE = 100000
MAX_DAYS_AGO=3652

Let me describe what is being set here:

  • MAX_EVENTS = Specifies the maximum number of input lines to add to any event. Example=”100000″. Default=”256″. Some of my messages were over 1000 lines so I shot for 1000x this number.
  • BREAK_ONLY_BEFORE = Splunk creates a new event if it encounters a new line that matches the regular expression set. Example=”From\s.+?@”. This breaks the GMail events in the correct place (before the line starting: “From xxxx@…”.
  • TRUNCATE = The default maximum line length (in bytes). Example=”10000″. Default=”100000″. 100000 used in this example seems unlikely to be broken unless a really messy message is found.
  • MAX_DAYS_AGO= Specifies the maximum number of days past, from the current date, that an extracted date can be valid. Example=”3652″. Default=”2000″. Given that I had messages older than 5 years (1826 days), I increased this to 10 years (3652 days)

More information can be found in the docs here. You should read them :)

Indexing the data

Splunk GMail
Now all you need to do is set this input as a new sourcetype (in the props.conf above I’ve used “gmail-mbox”) and then upload the file into Splunk.

A simple search for “sourcetype=gmail-mbox” should show all your events indexed and broken apart nicely.

As you can see from the screenshot above the events can vary quite drastically, e.g 21 line event to 821 line event. I have a number of events which are thousands of lines long (mainly the result of email bodies filled with HTML).

The histogram returned immediately gives us a good indication of month-on-month message volume. Note, this search shows both sent and received messages from your GMail account.

Field extraction

You’ll see that fields will not have been extracted correctly from your events, so we need to teach Splunk what this new .mbox format looks like.

For this first exercise I am only interested in the “labels”, “to”, and “from” fields. Here are the extractions I used in my props.conf here:

[gmail-mbox] #remove this line if using the Splunk GUI "Advanced Tab"
... # variables set earlier
EXTRACT-gmail-mbox-labels = X-Gmail-Labels\s*:\s*(?P<X_Gmail_Labels>[\w]+,[\w]+)
EXTRACT-gmail-mbox-from = From\s*:\s*(.*?)(?P<gmail_from>[\w]+@[\w]+.[\w]+)
EXTRACT-gmail-mbox-to = To\s*:\s*(.*?)(?P<gmail_to>[\w]+@[\w]+.[\w]+)

As you can see my regular expression skills are weak and I’m sure you can improve upon these extractions. I needed a fair bit of help just to get this far.

If anyone wants to share how they would pull out fields from GMail’s .mbox file format (or similar email format for that matter), join the conversation over on Splunk Answers or leave a comment on the post. Lots of kudos on offer :)

Search on

Splunk Gmail Top Senders

Here are some example searches to get you started.

Number of emails you’ve sent

sourcetype="gmail-mbox" gmail_labels=*Sent* | stats count

People you’ve received the most emails from:

sourcetype="gmail-mbox" NOT gmail_from=my@email.com | top limit=10 gmail_from

People you’ve sent the most emails to:

sourcetype="gmail-mbox" NOT gmail_to=my@email.com | top limit=10 gmail_from

To do

  • More interesting queries
  • Fine tune existing extractions
  • Add more extractions

Is it similar to MIT immerse?
Thanks
Rajnish

Rajnish
January 7, 2015

It could be.

In this post I’ve just covered getting data into Splunk. From there you can use searches to manipulate the data into visualisation. There’s some great Splunk Apps (e.g AfterGlow https://apps.splunk.com/app/277/) that will get you up and running with network graphs, similar to MIT Immerse, fairly quickly for example.

January 8, 2015

Any advice on how to do something similar with Microsoft Outlook? Very curious

Jeff Bridwell
January 8, 2015

Hey Jeff – another Splunker, Steve Gailey, has actually done some work with PST files. As luck would have it he wrote a blog too :) http://gailey.org.uk/post/Splunking-Email

January 8, 2015

It happened that my gmail mailbox is 1.6G in size. After downloading the zip file is about 700MB. It exceeds the limit of 500MB imposed by splunk.

Error message:
File too large. The file selected is 1651Mb. Maximum file size is 500Mb

Any workaround? Thanks.

haans
August 18, 2015

Unfortunately that is a limitation of the browser upload (Splunk can handle data of any size).

You can ingest files into Splunk using a number of other ways to avoid this problem. Check out: http://docs.splunk.com/Documentation/Splunk/latest/Data/Configureyourinputs

Also, answers.splunk.com will help too.

August 19, 2015

One Trackback

  1. URL on September 15, 2015

    … [Trackback]

    […] Read More: blogs.splunk.com/2015/01/05/smail-splunking-your-inbox/ […]