09 Flume

What is FLUME in Hadoop?

Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.

Flume supports multiple sources like –

In this tutorial, you will learn-

Flume Architecture

A Flume agent is a JVM process which has 3 components -Flume Source, Flume Channel and Flume Sink- through which events propagate after initiated at an external source.

img

Flume Architecture

  1. In the above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
  2. Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use a local file system in order to store these events.
  3. Flume sink removes the event from a channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Some Important features of FLUME

Flume, library and source code setup

Before we start with the actual process, ensure you have Hadoop installed. Change user to ‘hduser’ (id used while Hadoop configuration, you can switch to the userid used during your Hadoop config)

img

Step 1) Create a new directory with the name ‘FlumeTutorial’

sudo mkdir FlumeTutorial

  1. Give a read, write and execute permissions

sudo chmod -R 777 FlumeTutorial

  1. Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in this directory.

Download Input Files From Here

Check the file permissions of all these files and if ‘read’ permissions are missing then grant the same-

img

Step 2) Download ‘Apache Flume’ from a site- https://flume.apache.org/download.html

Apache Flume 1.4.0 has been used in this tutorial.

img

Next Click

img

Step 3) Copy the downloaded tarball in the directory of your choice and extract contents using the following command

sudo tar -xvf apache-flume-1.4.0-bin.tar.gz

img

This command will create a new directory named apache-flume-1.4.0-bin and extract files into it. This directory will be referred to as in rest of the article.

Step 4) Flume library setup

Copy twitter4j-core-4.0.1.jar, flume-ng-configuration-1.4.0.jar, flume-ng-core-1.4.0.jar, flume-ng-sdk-1.4.0.jar to

/lib/

It is possible that either or all of the copied JAR will have to execute permission. This may cause an issue with the compilation of code. So, revoke execute permission on such JAR.

In my case, twitter4j-core-4.0.1.jar was having to execute permission. I revoked it as below-

sudo chmod -x twitter4j-core-4.0.1.jar

img

After this command gives ‘read’ permission on twitter4j-core-4.0.1.jar to all.

sudo chmod +rrr /usr/local/apache-flume-1.4.0-bin/lib/twitter4j-core-4.0.1.jar

Please note that I have downloaded-

- twitter4j-core-4.0.1.jar from http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core

- All flame JARs i.e., flume-ng-*-1.4.0.jar from http://mvnrepository.com/artifact/org.apache.flume

Load data from Twitter using Flume

Step 1) Go to the directory containing source code files in it.

Step 2) Set CLASSPATH to contain /lib/* and ~/FlumeTutorial/flume/mytwittersource/*

export CLASSPATH=“/usr/local/apache-flume-1.4.0-bin/lib/:~/FlumeTutorial/flume/mytwittersource/

img

Step 3) Compile source code using the command-

javac -d . MyTwitterSourceForFlume.java MyTwitterSource.java

img

Step 4)Create a jar

First, create Manifest.txt file using a text editor of your choice and add below line in it-

Main-Class: flume.mytwittersource.MyTwitterSourceForFlume

.. here flume.mytwittersource.MyTwitterSourceForFlume is the name of the main class. Please note that you have to hit enter key at end of this line.

img

Now, create JAR ‘MyTwitterSourceForFlume.jar’ as-

jar cfm MyTwitterSourceForFlume.jar Manifest.txt flume/mytwittersource/*.class

img

Step 5) Copy this jar to /lib/

sudo cp MyTwitterSourceForFlume.jar /lib/

img

Step 6) Go to the configuration directory of Flume, /conf

If flume.conf does not exist, then copy flume-conf.properties.template and rename it to flume.conf

sudo cp flume-conf.properties.template flume.conf

img

If flume-env.sh does not exist, then copy flume-env.sh.template and rename it to flume-env.sh

sudo cp flume-env.sh.template flume-env.sh

img

Creating a Twitter Application

Step 1) Create a Twitter application by signing in to https://developer.twitter.com/

img

img

Step 2) Go to ‘My applications’ (This option gets dropped down when ‘Egg’ button at the top right corner is clicked)

img

Step 3) Create a new application by clicking ‘Create New App’

Step 4) Fill up application details by specifying the name of application, description, and website. You may refer to the notes given underneath each input box.

img

Step 5) Scroll down the page and accept terms by marking ‘Yes, I agree’ and click on button‘Create your Twitter application’

img

Step 6) On the window of a newly created application, go to the tab, ‘API Keys’ scroll down the page and click button ‘Create my access token’

img

img

Step 7) Refresh the page.

Step 8) Click on ‘Test OAuth’. This will display ‘OAuth’ settings of the application.

img

Step 9) Modify ‘flume.conf’ using these OAuth settings. Steps to modify ‘flume.conf’ are given below.

img

We need to copy Consumer key, Consumer secret, Access token and Access token secret to updating ‘flume.conf’.

Note: These values belong to the user and hence are confidential, so should not be shared.

Modify ‘flume.conf’ File

Step 1) Open ‘flume.conf’ in write mode and set values for below parameters-

sudo gedit flume.conf

Copy below contents-

MyTwitAgent.sources = Twitter MyTwitAgent.channels = MemChannel MyTwitAgent.sinks = HDFS MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume MyTwitAgent.sources.Twitter.channels = MemChannel MyTwitAgent.sources.Twitter.consumerKey = MyTwitAgent.sources.Twitter.consumerSecret = MyTwitAgent.sources.Twitter.accessToken = MyTwitAgent.sources.Twitter.accessTokenSecret = MyTwitAgent.sources.Twitter.keywords = guru99 MyTwitAgent.sinks.HDFS.channel = MemChannel MyTwitAgent.sinks.HDFS.type = hdfs MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/ MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000 MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0 MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000 MyTwitAgent.channels.MemChannel.type = memory MyTwitAgent.channels.MemChannel.capacity = 10000 MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

img

Step 2) Also, set TwitterAgent.sinks.HDFS.hdfs.path as below,

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://://flume/tweets/

img

To know , and , see value of parameter ‘fs.defaultFS’ set in $HADOOP_HOME/etc/hadoop/core-site.xml

img

Step 3) In order to flush the data to HDFS, as an when it comes, delete below entry if it exists,

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

Example: Streaming Twitter Data using Flume

Step 1) Open ‘flume-env.sh’ in write mode and set values for below parameters,

JAVA_HOME=

FLUME_CLASSPATH=”/lib/MyTwitterSourceForFlume.jar”

img

Step 2) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

Step 3) Two of the JAR files from the Flume tarball are not compatible with Hadoop 2.2.0. So, we will need to follow below steps to make Flume compatible with Hadoop 2.2.0.

a. Move protobuf-java-2.4.1.jar out of /lib’.

Go to /lib’

cd /lib

sudo mv protobuf-java-2.4.1.jar ~/

img

b. Find for JAR file ‘guava’ as below

find . -name “guava*”

img

Move guava-10.0.1.jar out of /lib’.

sudo mv guava-10.0.1.jar ~/

img

c. Download guava-17.0.jar from http://mvnrepository.com/artifact/com.google.guava/guava/17.0

img

Now, copy this downloaded jar file to /lib’

Step 4) Go to /bin’ and start Flume as-

./flume-ng agent -n MyTwitAgent -c conf -f /conf/flume.conf

img

Command prompt window where flume is fetching Tweets-

img

From command window message we can see that the output is written to /user/hduser/flume/tweets/ directory.

Now, open this directory using a web browser.

Step 5) To see the result of data load, using a browser open http://localhost:50070/ and browse the file system, then go to the directory where data has been loaded, that is-

/flume/tweets/

img

10 Pig

10 Pig Introduction