## Sunday, October 20, 2013

We setup a single machine cluster using a few simple steps. This is an experimental system to quickly get started with hadoop. Once we get familiar with the core components we can easily add more nodes to the system. There are a few differences from the official quickstart guide.

Lets assume you have extracted the contents at

Now let us generate ssh key so that we can connect the machine using ssh without typing a password.

$ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys


This should do the trick. Make sure you can connect to localhost without typing in password

$ssh localhost  You should be connected to local machine. Exit the session for next steps. Install JDK 7 on your machine. I am assuming jdk is instaed at /usr/lib/jvm/java-7-openjdk-amd64. Lets edit .bashrc (or your shells init script if you use something else than bash) and add following lines at the end. export HADOOP_HOME=/home/hduser/hadoop export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export PATH=$PATH:$HADOOP_HOME/bin  How edit the file$HADOOP_HOME/etc/hadoop/hadoop-env.sh and add the following line after export JSVC_HOME=${JSVC_HOME} line. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64  Now we need to edit 3 configuration files in hadoop. In each files we'll ne adding few property blocks inside the <configuration> ... </<configuration> block.$HADOOP_HOME/etc/hadoop/core-site.xml

<property>
<value>/home/hduser/tmp</value>
<description>TMP Directory</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description></description>
</property>


$HADOOP_HOME/etc/hadoop/mapred-site.xml <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>MapReduce job tracker</description> </property> $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
<description></description>
</property>


After editing the files we are now ready to format the filesystem-

$hdfs namenode -format  It should show some information and along with those you should see one line like: 13/10/20 11:02:21 INFO common.Storage: Storage directory /home/hduser/tmp/dfs/name has been successfully formatted. Now the HDFS filesystem is ready. Lets create a directory to store our data file. hadoop fs -mkdir /user hadoop fs -mkdir /user/hduser hadoop fs -mkdir /user/hduser/data  Now we need sample data files those contain a lot of words in each file. You may use a free book that is in text format. I have created three sample text files for this and put those in /home/hduser/data/ directory on my machine. We can copy these files to HDFS using following command: #Usage: hadoop fs -copyFromLocal <localsrc> URI bin/hdfs dfs -copyFromLocal /home/hduser/data/* /user/hduser/data There are two paths required for copyFromLocal operation- first one is the local files path and second one is HDFS uri. We can see the files using -ls command- hduser@hadoop:~/hadoop$ hdfs dfs -ls /user/hduser/data

-rw-r--r--   1 hduser supergroup     538900 2013-10-20 11:08 /user/hduser/data/file1.txt
-rw-r--r--   1 hduser supergroup     856710 2013-10-20 11:08 /user/hduser/data/file2.txt
-rw-r--r--   1 hduser supergroup     357800 2013-10-20 11:08 /user/hduser/data/file3.txt


Now lets start the hadoop cluster:

hduser@hadoop:~/hadoop$sbin/start-all.sh  This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh localhost: 09:36:31 up 14:05, 2 users, load average: 0.02, 0.04, 0.08 [output removed] We are now ready to run the mapreduce sample available with hadoop. Lets copt the jar file to our current directory. cp /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar ./  And run it by providing the mapreduce class name, data location and output location: bin/hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hduser/data /user/hduser/wcoutput  13/10/20 11:25:36 INFO input.FileInputFormat: Total input paths to process : 3 13/10/20 11:25:36 INFO mapreduce.JobSubmitter: number of splits:3 13/10/20 11:25:37 INFO mapreduce.Job: Running job: job_local3325941_0001 [output removed] 13/10/20 11:25:43 INFO mapreduce.Job: Job job_local3325941_0001 completed successfully OK. The job has been completed successfully. We should be able to see the output files in the output directory. hduser@hadoop:~/hadoop$hdfs dfs -ls /user/hduser/wcoutput/


-rw-r--r--   1 hduser supergroup          0 2013-10-20 11:25 /user/hduser/wcoutput/_SUCCESS
-rw-r--r--   1 hduser supergroup     880838 2013-10-20 11:25 /user/hduser/wcoutput/part-r-00000

$bin/hdfs dfs -copyToLocal /user/hduser/wcoutput/part-r-00000 ./wc_result.txt  And view that if you want.$vim wc_result.txt