Splunking 1 million URLs

Do you love URLs? I do! This is a great way to have insight about behaviors, catch malware, and help to classify what is going on in a network.

I also have a secret: I collect them. The more I have the happiest I am! So what’s best than Splunk to analyze them?

This is the first post of a bunch on what one can do with URLs and Splunk. Please share in comments war stories, or anything you are doing with Splunk and URLs so I can enrich the upcoming posts.

First, you need to grab the Alexa list, which contains top 1 million URLs in a CSV you can download.

We add the new data source to Splunk:

blog-data1

Splunk automagically discovers the CSV type, and we can start searching for our URLs right away.

Now we need an App to parse our URLs properly, fortunately Splunkbase has many:

blog-app1 blog-app2

 

 

 

 

 

 

If we start looking at our data, we can run a search such as

source=”top-1m.csv.zip:./top-1m.csv” | rex field=_raw “\d+,(?<url>\S+)”

We create a field url using our regex, and then use the lookup to parse those URLs and extract useful new fields:

blog-fields1

We can now look at the top count for domains without the attached TLD:

blog-search1

Which shows in this case Google, with a count of 145. That means Google appears in the top 1 million most visited URLs multiple times under various TLDs, such as:

google.com, google.om, google.li, google.co.ls, google.so, google.co.uk, etc.

If we now look at the top TLDs, it is easy to see com as a top TLD:

blog-tlds1

 

Amongst elements extracted, we have one field “url_url_type”, which can give various data, such as ipv6, ipv4, no_tld, unknown_tld, mozilla_tld.

blog-type1

The Mozilla TLD is only to show presence into the Public Suffix List. So whenever an entry appears in both “unknown_tld” and is in the top 1 million urls by Alexa, it starts to get interesting:

blog-unknown1

This is actually a TLD Romanized as rf, according to what Wikipedia can say about this one, which actually appears in the Mozilla Prefix list as following:

// xn--p1ai (“rf”, Russian-Cyrillic) : RU

// http://www.cctld.ru/en/docs/rulesrf.php

рф

But does not have the same encoding, hence showing some improvements that could be made in the lookup. Adding a Unicode to Punycode conversion?

 

Splunk offers a variety of apps, amongst which that can help analysts to understand more about great insight given by URLs. Happy Splunking!

 

 

 

 

 

 

 

 

 

One Trackback

  1. […] We are continuing our URL investigations, S1 episode 2. If you missed the first episode, you can go and read the blog post Splunking 1 million URLs first. […]