Sunday, October 20, 2013

Hadoop quickstart: Run your first MapReduce job in 10 minutes

We setup a single machine cluster using a few simple steps. This is an experimental system to quickly get started with hadoop. Once we get familiar with the core components we can easily add more nodes to the system. There are a few differences from the official quickstart guide.

Assuming that you have downloaded the latest hadoop binary from apache hadoop site lets start configuring it.

Lets assume you have extracted the contents at

/home/hduser/hadoop

Now let us generate ssh key so that we can connect the machine using ssh without typing a password.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

This should do the trick. Make sure you can connect to localhost without typing in password

$ ssh localhost


You should be connected to local machine. Exit the session for next steps.

Install JDK 7  on your machine. I am assuming jdk is instaed at /usr/lib/jvm/java-7-openjdk-amd64.

Lets edit .bashrc (or your shells init script if you use something else than bash) and add following lines at the end.

export HADOOP_HOME=/home/hduser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
How edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh and add the following line after export JSVC_HOME=${JSVC_HOME} line.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Now we need to edit 3 configuration files in hadoop. In each files we'll ne adding few property blocks inside the <configuration> ... </<configuration> block.

$HADOOP_HOME/etc/hadoop/core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/tmp</value>
  <description>TMP Directory</description>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
  <description></description>
</property>

$HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
  <description>MapReduce job tracker</description>
</property>

$HADOOP_HOME/etc/hadoop/hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description></description>
</property>

After editing the files we are now ready to format the filesystem-

$hdfs namenode -format

It should show some information and along with those you should see one line like:

13/10/20 11:02:21 INFO common.Storage: Storage directory /home/hduser/tmp/dfs/name has been successfully formatted.

Now the HDFS filesystem is ready. Lets create a directory to store our data file.

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -mkdir /user/hduser/data

Now we need sample data files those contain a lot of words in each file. You may use a free book that is in text format. I have created three sample text files for this and put those in /home/hduser/data/ directory on my machine.

We can copy these files to HDFS using following command:

#Usage: hadoop fs -copyFromLocal <localsrc> URI
bin/hdfs dfs -copyFromLocal  /home/hduser/data/* /user/hduser/data

There are two paths required for copyFromLocal  operation- first one is the local files path and second one is HDFS uri.  We can see the files using -ls command-

hduser@hadoop:~/hadoop$ hdfs dfs -ls /user/hduser/data


-rw-r--r--   1 hduser supergroup     538900 2013-10-20 11:08 /user/hduser/data/file1.txt
-rw-r--r--   1 hduser supergroup     856710 2013-10-20 11:08 /user/hduser/data/file2.txt
-rw-r--r--   1 hduser supergroup     357800 2013-10-20 11:08 /user/hduser/data/file3.txt

Now lets start the hadoop cluster:

hduser@hadoop:~/hadoop$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
localhost:  09:36:31 up 14:05,  2 users,  load average: 0.02, 0.04, 0.08
[output removed]

We are now ready to run the mapreduce sample available with hadoop. Lets copt the jar file to our current directory.

cp /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar ./

And run it by providing the mapreduce class name, data location and output location:

bin/hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hduser/data /user/hduser/wcoutput

13/10/20 11:25:36 INFO input.FileInputFormat: Total input paths to process : 3
13/10/20 11:25:36 INFO mapreduce.JobSubmitter: number of splits:3
13/10/20 11:25:37 INFO mapreduce.Job: Running job: job_local3325941_0001
[output removed]
13/10/20 11:25:43 INFO mapreduce.Job: Job job_local3325941_0001 completed successfully

OK. The job has been completed successfully. We should be able to see the output files in the output directory.

hduser@hadoop:~/hadoop$hdfs dfs -ls /user/hduser/wcoutput/

-rw-r--r--   1 hduser supergroup          0 2013-10-20 11:25 /user/hduser/wcoutput/_SUCCESS
-rw-r--r--   1 hduser supergroup     880838 2013-10-20 11:25 /user/hduser/wcoutput/part-r-00000

Now we can download the output file to our local disk-

#Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
$bin/hdfs dfs -copyToLocal  /user/hduser/wcoutput/part-r-00000 ./wc_result.txt

And view that if you want.

$vim wc_result.txt

Finally we stop the cluter.

$sbin/stop-all.sh