Splunking 1 million URLs
Do you love URLs? I do! This is a great way to have insight about behaviors, catch malware, and help to classify what is going on in a network.
I also have a secret: I collect them. The more I have the happiest I am! So what’s best than Splunk to analyze them?
This is the first post of a bunch on what one can do with URLs and Splunk. Please share in comments war stories, or anything you are doing with Splunk and URLs so I can enrich the upcoming posts.
We add the new data source to Splunk:
Splunk automagically discovers the CSV type, and we can start searching for our URLs right away.
Now we need an App to parse our URLs properly, fortunately Splunkbase has many:
If we start looking at our data, we can run a search such as
source=”top-1m.csv.zip:./top-1m.csv” | rex field=_raw “\d+,(?<url>\S+)”
We create a field url using our regex, and then use the lookup to parse those URLs and extract useful new fields:
We can now look at the top count for domains without the attached TLD:
Which shows in this case Google, with a count of 145. That means Google appears in the top 1 million most visited URLs multiple times under various TLDs, such as:
google.com, google.om, google.li, google.co.ls, google.so, google.co.uk, etc.
If we now look at the top TLDs, it is easy to see com as a top TLD:
Amongst elements extracted, we have one field “url_url_type”, which can give various data, such as ipv6, ipv4, no_tld, unknown_tld, mozilla_tld.
The Mozilla TLD is only to show presence into the Public Suffix List. So whenever an entry appears in both “unknown_tld” and is in the top 1 million urls by Alexa, it starts to get interesting:
This is actually a TLD Romanized as rf, according to what Wikipedia can say about this one, which actually appears in the Mozilla Prefix list as following:
// xn--p1ai (“rf”, Russian-Cyrillic) : RU
But does not have the same encoding, hence showing some improvements that could be made in the lookup. Adding a Unicode to Punycode conversion?
Splunk offers a variety of apps, amongst which that can help analysts to understand more about great insight given by URLs. Happy Splunking!