Assuming that you have downloaded the latest hadoop binary from apache hadoop site lets start configuring it.
Lets assume you have extracted the contents at
/home/hduser/hadoop
Now let us generate ssh key so that we can connect the machine using ssh without typing a password.
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
This should do the trick. Make sure you can connect to localhost without typing in password
$ ssh localhost
You should be connected to local machine. Exit the session for next steps.
Install JDK 7 on your machine. I am assuming jdk is instaed at /usr/lib/jvm/java-7-openjdk-amd64.
Lets edit .bashrc (or your shells init script if you use something else than bash) and add following lines at the end.
export HADOOP_HOME=/home/hduser/hadoop export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export PATH=$PATH:$HADOOP_HOME/binHow edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh and add the following line after
export JSVC_HOME=${JSVC_HOME}
line.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Now we need to edit 3 configuration files in hadoop. In each files we'll ne adding few property blocks inside the <configuration> ... </<configuration> block.
$HADOOP_HOME/etc/hadoop/core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/home/hduser/tmp</value> <description>TMP Directory</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description></description> </property>
$HADOOP_HOME/etc/hadoop/mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>MapReduce job tracker</description> </property>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> <description></description> </property>
After editing the files we are now ready to format the filesystem-
$hdfs namenode -format
It should show some information and along with those you should see one line like:
13/10/20 11:02:21 INFO common.Storage: Storage directory /home/hduser/tmp/dfs/name has been successfully formatted.
Now the HDFS filesystem is ready. Lets create a directory to store our data file.
hadoop fs -mkdir /user hadoop fs -mkdir /user/hduser hadoop fs -mkdir /user/hduser/data
Now we need sample data files those contain a lot of words in each file. You may use a free book that is in text format. I have created three sample text files for this and put those in /home/hduser/data/ directory on my machine.
We can copy these files to HDFS using following command:
#Usage: hadoop fs -copyFromLocal <localsrc> URI
bin/hdfs dfs -copyFromLocal /home/hduser/data/* /user/hduser/data
There are two paths required for copyFromLocal operation- first one is the local files path and second one is HDFS uri. We can see the files using -ls command-
hduser@hadoop:~/hadoop$ hdfs dfs -ls /user/hduser/data -rw-r--r-- 1 hduser supergroup 538900 2013-10-20 11:08 /user/hduser/data/file1.txt -rw-r--r-- 1 hduser supergroup 856710 2013-10-20 11:08 /user/hduser/data/file2.txt -rw-r--r-- 1 hduser supergroup 357800 2013-10-20 11:08 /user/hduser/data/file3.txt
Now lets start the hadoop cluster:
hduser@hadoop:~/hadoop$ sbin/start-all.shThis script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
localhost: 09:36:31 up 14:05, 2 users, load average: 0.02, 0.04, 0.08
[output removed]
We are now ready to run the mapreduce sample available with hadoop. Lets copt the jar file to our current directory.
cp /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar ./
And run it by providing the mapreduce class name, data location and output location:
bin/hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hduser/data /user/hduser/wcoutput
13/10/20 11:25:36 INFO input.FileInputFormat: Total input paths to process : 3
13/10/20 11:25:36 INFO mapreduce.JobSubmitter: number of splits:3
13/10/20 11:25:37 INFO mapreduce.Job: Running job: job_local3325941_0001
[output removed]
13/10/20 11:25:43 INFO mapreduce.Job: Job job_local3325941_0001 completed successfully
OK. The job has been completed successfully. We should be able to see the output files in the output directory.
hduser@hadoop:~/hadoop$hdfs dfs -ls /user/hduser/wcoutput/
-rw-r--r-- 1 hduser supergroup 0 2013-10-20 11:25 /user/hduser/wcoutput/_SUCCESS
-rw-r--r-- 1 hduser supergroup 880838 2013-10-20 11:25 /user/hduser/wcoutput/part-r-00000
Now we can download the output file to our local disk-
#Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
$bin/hdfs dfs -copyToLocal /user/hduser/wcoutput/part-r-00000 ./wc_result.txt
And view that if you want.
$vim wc_result.txt
Finally we stop the cluter.
$sbin/stop-all.sh