Apache Flume Collecting Twitter Data

posted on Nov 20th, 2016

Apache Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Flume 1.6.0 pre installed (How to install Flume on Ubuntu 14.04)

Apache Flume Collecting Twitter Data

We will create an application and get the tweets from it using the experimental twitter source provided by Apache Flume. We will use the memory channel to buffer these tweets and HDFS sink to push these tweets into the HDFS.

Step 1 - Create an application in twitter with your twitter account. Browse to below twitter URL to create twitter application.

https://apps.twitter.com/

a) Sign in to your Twitter account. You will have a Twitter Application Management window where you can create, delete, and manage Twitter Apps.

b) Click on the Create New App button. You will be redirected to a window where you will get an application form in which you have to fill in your details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.

c) Fill in the details, accept the Developer Agreement when finished, click on the Create your Twitter application button which is at the bottom of the page. If everything goes fine, an App will be created.

d) Under keys and Access Tokens tab at the bottom of the page, you can observe a button named Create my access token. Click on it to generate the access token.

e) Finally, click on the Test OAuth button which is on the right side top of the page. This will lead to a page which displays your Consumer key, Consumer secret, Access token, and Access token secret. Copy these details. These are useful to configure the agent in Flume.

Step 2 - Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Step 3 - Start all hadoop daemons.

$ start-all.sh

Step 4 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.

$ jps

Step 5 - Create a /user/hduser/twitter_data folder in HDFS.

$ hdfs dfs -mkdir hdfs://localhost:9000/user/hduser/twitter_data

Step 6 - Copy these twitter jar files in /usr/local/flume/lib/ folder. You can download these jar files from internet.

twitter4j-async-4.0.4.jar
twitter4j-core-4.0.4.jar
twitter4j-media-support-4.0.4.jar
twitter4j-stream-4.0.4.jar

Step 7 - Edit flume-env.sh file.

$ gedit flume-env.sh

Step 8 - Add flume library path to flume-env.sh file. Save and Close.

export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

Step 9 - Configuration File

Given below is an example of the configuration file. Copy this content and save as twitter.conf in the conf folder of Flume.

Dont forget to change consumerKey, consumerSecret, accessToken, accessTokenSecret with your twitter OAuths.

twitter.conf

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = bVd3fwceBGCvjghPqjVF6A2jW
TwitterAgent.sources.Twitter.consumerSecret = 86EPCj7ByjPpPTx4vNN1nTYqOsdjN0v7ZsainjEgjGY6KzwjFV
TwitterAgent.sources.Twitter.accessToken = ******************-0NpAbHQt1WW2NM5njFieh6xVA0BwedG
TwitterAgent.sources.Twitter.accessTokenSecret = lUcbFDxu08lRE6uIISHE9fgAsEdZXKCh6MTpJqbplYUXy

TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/hduser/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 5
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10

# Describing/Configuring the channel

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

$ cp /home/hduser/Desktop/FLUME/twitter.conf /usr/local/flume/conf/

Step 10 - Change the directory to /usr/local/flume

$ cd $FLUME_HOME

Step 11 - Execution

$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Apache Flume Installation on Ubuntu 14.04   Flume NetCat Agent Configuration   Flume Moving Tomcat Logs to HDFS   Flume SeqGen Agent Configuration