How to: Splunk Analytics for Hadoop on Amazon EMR.

Using Amazon EMR and Splunk Analytics for Hadoop to explore, analyze and visualize machine data

Machine data can take many forms and comes from a variety of sources; system logs, application logs, service and system metrics, sensors data etc. In this step-by-step guide, you will learn how to build a big data solution for fast, interactive analysis of data stored in Amazon S3 or Hadoop. This hands-on guide is useful for solution architects, data analysts and developers.

This guide will see you:

  1. Setup an EMR cluster
  2. Setup a Splunk Analytics for Hadoop node
  3. Connect to data in your S3 buckets
  4. Explore, visualize and report on your data

You will need:

  1. An Amazon EMR Cluster
  2. A Splunk Analytics for Hadoop Instance
  3. Amazon S3 bucket with your data
    • Data can also be in Hadoop Distributed File System (HDFS)

Picture1

To get started, go into Amazon EMR from the AWS management console page:

Picture2

From here, you can manage your existing clusters, or create a new cluster. Click on ‘Create Cluster’:

Picture3

 

This will take you to the configuration page. Set a meaningful cluster name, enable logging (if required) to an existing Amazon S3 bucket, and set the launch mode to cluster:

Picture4

Under software configuration, choose Amazon EMR 5.x as per the following:

Picture5

Several of the applications included are not required to run Splunk Analytics for Hadoop, however they may make management of your environment easier.

Choose the appropriate instance types, and number of instances according to your requirements:

Picture6

** please note that Splunk recommends Hadoop nodes to be 8 cores / 16 vCPU. The M3.xlarge instances were used for demonstration here only.

For security and access settings, choose those appropriate to your deployment scenario. Using the defaults here can be an appropriate option:

Picture7

Click ‘Create Cluster’.

This process may take some time. Keep an eye on the Cluster list for status changes:

Picture8

When the cluster is deployed and ready:

Picture9

Clicking on the cluster name will provide the details of the set up:

Picture10

At this point, browse around the platform, and get familiar with the operation of the EMR cluster. Hue is a good option for managing the filesystem, and the data that will be analyzed through Splunk Analytics for Hadoop.

Configure Splunk Analytics for Hadoop on AWS AMI instance to connect to EMR Cluster

Installing Splunk Analytics for Hadoop on a separate Amazon EC2 instance, removed from yourAmazon EMR cluster is the Splunk recommended architectural approach. In order to configure this setup, we run up a Splunk 6.5 AMI from the AWS Marketplace, and then add the necessary Hadoop,Amazon S3 and Java libraries. This last step is further outlined on Splunk docs at -http://docs.splunk.com/Documentation/HadoopConnect/1.2.3/DeployHadoopConnect/HadoopCLI

To kick off, launch a newAmazon EC2 instance from the AWS Management Console:

Picture11

Search the AWS Marketplace for Splunk and select the Splunk Enterprise 6.5 AMI:

Picture12

Choose an instance size to suit your environment and requirements:

Picture13

**please note that Splunk recommends minimum hardware specs for a production deployment. More details at http://docs.splunk.com/Documentation/Splunk/6.5.0/Installation/Systemrequirements

From here you can choose to further customize the instance (should you want more storage, or to add custom tags), or just review and launch:

Picture14

Now, you’ll need to add the Hadoop,Amazon S3 and Java client libraries to the newly deployed Splunk AMI. To do this, first grab the versions from theAmazon EMR master node for each, to ensure that you are matching the libraries on your Splunk server. Once you have them, install them on the Splunk AMI:

Picture15

Move this to /usr/bin and unpack it.

In order to search theAmazon S3 data, we need to ensure we have access to the S3 toolset. Add the following line to the file /usr/bin/hadoop/etc/hadoop/hadoop-env.sh:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop‌​/tools/lib/*

Finally, we need to setup the necessary authentication to access Amazon S3 via our new virtual index connection. You’ll need a secret key ID and access key from your AWS Identity and Access Management (IAM) setup. In this instance, we have setup these credentials for an individual AWS user:

Picture16

Ensure that when you create the access key, you record the details. You then need to include these in the file located at /usr/bin/hadoop/etc/hadoop/hdfs-site.xml. Include the following within the <configuration> tag:

<property>
   <name>fs.s3.awsAccessKeyId</name>
   <value>xxxx</value>
</property>
<property>
   <name>fs.s3.awsSecretAccessKey</name>
   <value>xxxx</value>
</property>
<property>
   <name>fs.s3n.awsAccessKeyId</name>
   <value>xxxx</value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>xxxx</value>
</property>

You need to include the s3n keys, as that is the mechanism we will use to connect to the Amazon s3 dataset.

Create data to analyze with Splunk Analytics for Hadoop

We have multiple options for connecting to data for investigation within Splunk Analytics for Hadoop. In this guide, we will explore adding files to HDFS via Hue, and connecting to an existing Amazon S3 bucket to explore data.

From the AWS Management Console, go into Amazon S3, and create a new bucket:

Picture17

Give the bucket a meaningful name, and specify the region in which you would like it to exist:

Picture18

 

Click create, and add some files to this new bucket as appropriate. You can choose to add the files to the top level, or create a directory structure:

Picture20

The files or folders that you create within the Amazon S3 bucket need to have appropriate permissions to allow the Splunk Analytics for Hadoop user to connect and view them. Set these to allow ‘everyone’ read access, and reduce this scope to appropriate users or roles after testing.

Set up Splunk Analytics for Hadoop for data analysis

To proceed, first you’ll need to grab some parameters from the Hadoop nodes:

Collect Hadoop and Yarn variables:

  1. Java Home = type ‘which java’ = /usr/bin/java
  2. Hadoop home = type ‘which hadoop’ = /usr/bin/hadoop
  3. Hadoop version = type ‘hadoop version’ = hadoop 2.7.2-amzn-3
  4. Name node port = In a browser go to http://masternodeaddress:50070 (or click on HDFS name node in the EMR management console screen)
  5. Yarn resource manager scheduler address= In a browser go to http://masternodeaddress:8088/conf (or click on ‘resource manager’ in the EMR management console screen) = look for ‘yarn.resourcemanager.scheduler.address’ = x.x.x:8030
  6. Yarn resource manager address= In a browser go to http://masternodeaddress:8088/conf (or click on ‘resource manager’ in the EMR management console screen) = look for ‘yarn.resourcemanager.address’ = x.x.x:8050

Now, we need to verify that the name node is correct. You can do this by executing this command:

hadoop fs –ls hdfs://masternodeaddress:8020/user/root/data

Now we can configure our Virtual Provider in Splunk. To do this, go to settings, and then Virtual Indexes:

Picture22

Then choose to create a new provider:

Picture23

Using the parameters that we gathered earlier, fill this section out:

Picture24

Picture25

 

Save this setup, and go to set up a new Virtual Index:

Picture26

Here you can specify the S3 bucket that was created:

Picture27

Ensure that you use the s3n prefix here.

Save this set up, and you should now be able to search the data within Amazon S3 (or HDFS) using Splunk Analytics for Hadoop!

Click search on the virtual index config:

Picture29

Which will take you to the Splunk search interface. You should see something like the following:

Picture30

**Please note: The following is an example approach outlining a functional Splunk Analytics for Hadoop environment running on AWS EMR. Please talk to your local Splunk team to determine the best architecture for you.