Hadoop Installation Tutorial (Hadoop 2.x)
Posted on In Computing systems, Resource management, Storage systems, Systems, TutorialHadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.
How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves). If you are not familiar with these names, please take a look at YARN architecture first.
First we assume that you have created a Linux user “hadoop” on each nodes that you will use for running Hadoop and the “hadoop” user’s home directory is /home/hadoop/
.
Table of Contents
Configure hostnames
Hadoop uses hostnames to identify nodes by default. So you should first give each host a name and allow other nodes to find the IP of the hosts with names. The most simple way may be adding the host to IP mappings in every nodes’ /etc/hosts file. For larger cluster, you may use a DNS service. Here, we use 3 nodes as the example and these lines are added to the /etc/hosts file on each node:
10.0.3.29 hofstadter
10.0.3.30 snell
10.0.3.31 biot
Enable “hadoop” user to password-less SSH login to slaves
Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.
Details about password-less SSH login can be found in Enabling Password-less ssh Login.
Install software needed by Hadoop
The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.
Java JDK
Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.
As an example in this tutorial, the JDK is installed into
/usr/java/default/
You may need to make soft link to /usr/java/default from the actual location where you installed JDK.
Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:
export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH
Hadoop
Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop 2.5.0.
You can unpack the tar ball to a directory. In this example, we unpack it to
/home/hadoop/hadoop/
which is a directory under the hadoop Linux user’s home directory.
The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.
Configure environment variables for the “hadoop” user
We assume the “hadoop” user uses bash as its shell.
Add these lines at the bottom of ~/.bashrc on all nodes:
export HADOOP_COMMON_HOME=$HOME/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/sbin
The last 2 lines add hadoop’s bin directory to the PATH so that we can directly run hadoop’s commands without specifying the full path to it.
Configure Hadoop
The configuration files for Hadoop is under $HADOOP_COMMON_HOME/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration>
and </configuration>
.
core-site.xml
Here the NameNode runs on biot
.
<property>
<name>fs.defaultFS</name>
<value>hdfs://biot/</value>
<description>NameNode URI</description>
</property>
yarn-site.xml
The YARN ResourceManager runs on biot and supports MapReduce shuffle.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>biot</value>
<description>The hostname of the ResourceManager</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service for MapReduce</description>
</property>
hdfs-site.xml
The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hdfs/</value>
<description>DataNode directory for storing data chunks.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Number of replication for each chunk.</description>
</property>
mapred-site.xml
First copy mapred-site.xml.template to mapred-site.xml and add the following content.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>Execution framework.</description>
</property>
slaves
Delete localhost and add all the names of the TaskTrackers, each in on line. For example:
hofstadter
snell
biot
Duplicate Hadoop configuration files to all nodes
We may duplicate the Hadoop directory and configuration files under the etc/hadoop directory to all nodes. You may use this script to duplicate the hadoop directory:
cd
for i in `cat hadoop/etc/hadoop/slaves`; do
echo $i; rsync -avxP --exclude=logs hadoop/ $i:hadoop/;
done
By now, we have finished copying Hadoop software and configuring the Hadoop. Now let’s have some fun with Hadoop.
Format a new distributed filesystem
Format a new distributed file system by
hdfs namenode -format
Start HDFS/YARN
Manually
Start the HDFS with the following command, run on the designated NameNode:
hadoop-daemon.sh --script hdfs start namenode
Run the script to start DataNodes on each slave:
hadoop-daemon.sh --script hdfs start datanode
Start the YARN with the following command, run on the designated ResourceManager:
yarn-daemon.sh start resourcemanager
Run a script to start NodeManagers on all slaves:
yarn-daemon.sh start nodemanager
By scripts
start-dfs.sh
start-yarn.sh
Check status
Check HDFS status
hdfs dfsadmin -report
It should show a report like:
Configured Capacity: 158550355968 (147.66 GB)
Present Capacity: 11206017024 (10.44 GB)
DFS Remaining: 11205943296 (10.44 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 10.0.3.30:50010 (snell)
Hostname: snell
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49800732672 (46.38 GB)
DFS Remaining: 3049361408 (2.84 GB)
DFS Used%: 0.00%
DFS Remaining%: 5.77%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014
Name: 10.0.3.31:50010 (biot)
Hostname: biot
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49513574400 (46.11 GB)
DFS Remaining: 3336519680 (3.11 GB)
DFS Used%: 0.00%
DFS Remaining%: 6.31%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014
Name: 10.0.3.29:50010 (hofstadter)
Hostname: hofstadter
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 48030068736 (44.73 GB)
DFS Remaining: 4820025344 (4.49 GB)
DFS Used%: 0.00%
DFS Remaining%: 9.12%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014
Check YARN status
yarn node -list
It should show a report like:
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
hofstadter:43469 RUNNING hofstadter:8042 0
snell:57039 RUNNING snell:8042 0
biot:52834 RUNNING biot:8042 0
Run basic tests
Let’s play with grep
as the basic test of Hadoop YARN/HDFS/MapReduce.
First, create the directory to store data for the hadoop user.
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop
Then, put the configuration file directory as the input.
hadoop fs -put /home/hadoop/hadoop/etc/hadoop /user/hadoop/hadoop-config
or simply refer the directory under the hadoop user’s HDFS home (check the the discussion, thanks to Thirumal Venkat for this tip):
hadoop fs -put /home/hadoop/hadoop/etc/hadoop hadoop-config
Let’s ls
it to check the content:
hdfs dfs -ls /user/hadoop/hadoop-config
It should print out the results like follows.
Found 26 items
-rw-r--r-- 3 hadoop supergroup 3589 2014-09-05 07:53 /user/hadoop/hadoop-config/capacity-scheduler.xml
-rw-r--r-- 3 hadoop supergroup 1335 2014-09-05 07:53 /user/hadoop/hadoop-config/configuration.xsl
-rw-r--r-- 3 hadoop supergroup 318 2014-09-05 07:53 /user/hadoop/hadoop-config/container-executor.cfg
-rw-r--r-- 3 hadoop supergroup 917 2014-09-05 07:53 /user/hadoop/hadoop-config/core-site.xml
-rw-r--r-- 3 hadoop supergroup 3589 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.cmd
-rw-r--r-- 3 hadoop supergroup 3443 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.sh
-rw-r--r-- 3 hadoop supergroup 2490 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics.properties
-rw-r--r-- 3 hadoop supergroup 1774 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics2.properties
-rw-r--r-- 3 hadoop supergroup 9201 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-policy.xml
-rw-r--r-- 3 hadoop supergroup 775 2014-09-05 07:53 /user/hadoop/hadoop-config/hdfs-site.xml
-rw-r--r-- 3 hadoop supergroup 1449 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-env.sh
-rw-r--r-- 3 hadoop supergroup 1657 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-log4j.properties
-rw-r--r-- 3 hadoop supergroup 21 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-signature.secret
-rw-r--r-- 3 hadoop supergroup 620 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-site.xml
-rw-r--r-- 3 hadoop supergroup 11118 2014-09-05 07:53 /user/hadoop/hadoop-config/log4j.properties
-rw-r--r-- 3 hadoop supergroup 918 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.cmd
-rw-r--r-- 3 hadoop supergroup 1383 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.sh
-rw-r--r-- 3 hadoop supergroup 4113 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-queues.xml.template
-rw-r--r-- 3 hadoop supergroup 887 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml
-rw-r--r-- 3 hadoop supergroup 758 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml.template
-rw-r--r-- 3 hadoop supergroup 22 2014-09-05 07:53 /user/hadoop/hadoop-config/slaves
-rw-r--r-- 3 hadoop supergroup 2316 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-client.xml.example
-rw-r--r-- 3 hadoop supergroup 2268 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-server.xml.example
-rw-r--r-- 3 hadoop supergroup 2178 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.cmd
-rw-r--r-- 3 hadoop supergroup 4567 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.sh
-rw-r--r-- 3 hadoop supergroup 1007 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-site.xml
Now, let’s run the grep
now:
cd
hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/hadoop-config /user/hadoop/output 'dfs[a-z.]+'
It will print status as follows if everything works well.
14/09/05 07:54:36 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:37 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:37 INFO input.FileInputFormat: Total input paths to process : 26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: number of splits:26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0001
14/09/05 07:54:37 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:38 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0001
14/09/05 07:54:38 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0001/
14/09/05 07:54:38 INFO mapreduce.Job: Running job: job_1409903409779_0001
14/09/05 07:54:45 INFO mapreduce.Job: Job job_1409903409779_0001 running in uber mode : false
14/09/05 07:54:45 INFO mapreduce.Job: map 0% reduce 0%
14/09/05 07:54:50 INFO mapreduce.Job: map 23% reduce 0%
14/09/05 07:54:52 INFO mapreduce.Job: map 81% reduce 0%
14/09/05 07:54:53 INFO mapreduce.Job: map 100% reduce 0%
14/09/05 07:54:56 INFO mapreduce.Job: map 100% reduce 100%
14/09/05 07:54:56 INFO mapreduce.Job: Job job_1409903409779_0001 completed successfully
14/09/05 07:54:56 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=319
FILE: Number of bytes written=2622017
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=65815
HDFS: Number of bytes written=405
HDFS: Number of read operations=81
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=26
Launched reduce tasks=1
Data-local map tasks=26
Total time spent by all maps in occupied slots (ms)=116856
Total time spent by all reduces in occupied slots (ms)=3000
Total time spent by all map tasks (ms)=116856
Total time spent by all reduce tasks (ms)=3000
Total vcore-seconds taken by all map tasks=116856
Total vcore-seconds taken by all reduce tasks=3000
Total megabyte-seconds taken by all map tasks=119660544
Total megabyte-seconds taken by all reduce tasks=3072000
Map-Reduce Framework
Map input records=1624
Map output records=23
Map output bytes=566
Map output materialized bytes=469
Input split bytes=3102
Combine input records=23
Combine output records=12
Reduce input groups=10
Reduce shuffle bytes=469
Reduce input records=12
Reduce output records=10
Spilled Records=24
Shuffled Maps =26
Failed Shuffles=0
Merged Map outputs=26
GC time elapsed (ms)=363
CPU time spent (ms)=15310
Physical memory (bytes) snapshot=6807674880
Virtual memory (bytes) snapshot=32081272832
Total committed heap usage (bytes)=5426970624
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=62713
File Output Format Counters
Bytes Written=405
14/09/05 07:54:56 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:56 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:56 INFO input.FileInputFormat: Total input paths to process : 1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: number of splits:1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0002
14/09/05 07:54:56 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:56 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0002
14/09/05 07:54:56 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0002/
14/09/05 07:54:56 INFO mapreduce.Job: Running job: job_1409903409779_0002
14/09/05 07:55:02 INFO mapreduce.Job: Job job_1409903409779_0002 running in uber mode : false
14/09/05 07:55:02 INFO mapreduce.Job: map 0% reduce 0%
14/09/05 07:55:07 INFO mapreduce.Job: map 100% reduce 0%
14/09/05 07:55:12 INFO mapreduce.Job: map 100% reduce 100%
14/09/05 07:55:13 INFO mapreduce.Job: Job job_1409903409779_0002 completed successfully
14/09/05 07:55:13 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=265
FILE: Number of bytes written=193601
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=527
HDFS: Number of bytes written=179
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2616
Total time spent by all reduces in occupied slots (ms)=2855
Total time spent by all map tasks (ms)=2616
Total time spent by all reduce tasks (ms)=2855
Total vcore-seconds taken by all map tasks=2616
Total vcore-seconds taken by all reduce tasks=2855
Total megabyte-seconds taken by all map tasks=2678784
Total megabyte-seconds taken by all reduce tasks=2923520
Map-Reduce Framework
Map input records=10
Map output records=10
Map output bytes=239
Map output materialized bytes=265
Input split bytes=122
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=265
Reduce input records=10
Reduce output records=10
Spilled Records=20
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=20
CPU time spent (ms)=2090
Physical memory (bytes) snapshot=415334400
Virtual memory (bytes) snapshot=2382364672
Total committed heap usage (bytes)=401997824
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=405
File Output Format Counters
Bytes Written=179
After the grep
execution finishes, we can check the results. We can check the content in the output directory by:
hdfs dfs -ls /user/hadoop/output/
It should print output as follows.
Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2014-09-05 07:55 /user/hadoop/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 179 2014-09-05 07:55 /user/hadoop/output/part-r-00000
part-r-00000
contains the result. Let’s cat it out to check the content.
hdfs dfs -cat /user/hadoop/output/part-r-00000
It should print output as follows.
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.file
Stop HDFS/YARN
Manually
On the ResourceManager and NodeManager nodes:
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager
On the NameNode:
hadoop-daemon.sh --script hdfs stop namenode
On each DataNodes:
hadoop-daemon.sh --script hdfs stop datanode
With scripts
stop-yarn.sh
stop-dfs.sh
Debug information
You may check logs for debugging. The logs on each nodes are under:
hadoop/logs/
You may want to cleanup everything on all nodes. To remove data directories on data node (if you did not set the hdfs-site.xml to choose the directories by yourself). (Be careful, the following scripts will delete everything in /tmp that your current user can delete. You may adapt it if you stores some useful data under /tmp .)
rm -rf /tmp/* ~/hadoop/logs/*
for i in `cat hadoop/etc/hadoop/slaves`; do
echo $i; ssh $i 'rm -rf /tmp/* ~/hadoop/logs/';
done
More readings on HDFS/YARN/Hadoop
Here are links to some good articles related to Hadoop on the Web:
Default Hadoop configuration values: https://www.systutorials.com/qa/749/hadoop-2-yarn-default-configuration-values
Official cluster setup tutorial: https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
Guides to configure Hadoop:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
Additional notes
Control number of containers
You may want to configure the number of containers managed by YARN on each nodes. You can refer to this example below.
The following lines are added to yarn-site.xml to specify that each node uses 3072MB memory and each container uses at least 1536MB memory. That is, at most 2 containers on each node.
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1536</value>
</property>
I believe if you use
/user/hadoop
instead you can directly access folders inside it similar to your home directory in HDFS.Ex:
/user/hadoop/input
can be just referenced asinput
while accessing HDFS.On my cluster:
$ hdfs dfs -ls /user/hadoop/
ls: `/user/hadoop/': No such file or directory
and
$ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gz
put: `hadoop-2.5.0.tar.gz': No such file or directory
It seems that unless the /user/hadoop/ directory is not automatically created.
After it is created, it can be just referenced as you posted. Nice!
BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup
171 @Override
172 public Path getHomeDirectory() {
173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName()));
174 }
Thank you very much for this guide. The apache docs are not easy to follow .
Dear Eric Zhiqiang,
I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
Hi md,
1. Yes.
2. I used the NameNode in this tutorial as the “client node”.
3. No if you do not want to run DataNode on the NameNode node.
4. I suggest using Oracle JVM. But 1.7.0_09-icedtea seems reported “Good” too: https://wiki.apache.org/hadoop/HadoopJavaVersions .
Thanks Eric Zhiqiang for your quick response.
It seems the DataNodes are not identified by the NameNode.
1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.
2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: https://www.systutorials.com/qa/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).
3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.
Hope these tips help.
Hi Eric,
I am getting the following error when trying the check the HDFS status on namenode or datanode:
[hadoop@namenode ~]$ hdfs dfsadmin -report
14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
————————————————-
Can you please suggest me a solution.
Thanks in advance.
Hi Eric,
Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
1. Stop the cluster
2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
4. Restart the cluster
Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started
Hi md,
Noted. Thanks for sharing.
NameNode metadata is critical for the whole HDFS cluster. If you use a single node for the NameNode, make replicas of the metadata on 2 or more separated disks for higher data reliability.
Please check this post for more information on how to replicate and set up 2-disk metadata storage for the NameNode:
https://www.systutorials.com/qa/1315/add-new-hdfs-namenode-metadata-directory-existing-cluster
I had better luck defining JAVA_HOME in hadoop_env.sh
That’s better if you use the Java of a version different from the global one.
Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos
Hi Eric,
Very useful blog. Just wondering about containers – do you have more details on them. For example:
1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?
Thanks
C.Chee
You may check the “Architecture of Next Generation Apache Hadoop MapReduce
Framework”:
https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf
The “Resource Model” discussed the mode for YARN v1.0:
For the implementation, you may need to dive into the source code tree.
Thaanks for the tutorial. I have one question and one issu requiring your help.
Is the value mapreduce_shuffle or mapreduce.shuffle?
yarn.nodemanager.aux-services
mapreduce_shuffle
shuffle service for MapReduce
I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.
Could you please help me to get out of this exception.
15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr
ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Hi Tariq,
It is “mapreduce_shuffle”.
Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
I have no idea what’s the reason for the error.
The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.
Dear Eric,
I am having problem with YARN/Mapred Configuration. I have asked these question at stackoverflow. Could you please have a look at these question and answer them, if possible. Thanks
http://stackoverflow.com/questions/28586561/yarn-container-lauch-failed-exception-and-mapred-site-xml-configuration
http://stackoverflow.com/questions/28609639/yarn-container-configuration-for-javacv
Regards,
Hi Tariq, just noticed your comment.
I do ever experience the `exitCode=134` problem once.
My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:
What I did is only this.
You may check how much memory your program uses in one task and set the value to be larger than that.
Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.
HI Eric,
Please help me ,I am unable to run my first basic example,i am getting below message
“15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”
Thanks
Shekar M
Make sure `hdfs dfs -ls /usr/local/hadoop/input` exists. As Hadoop prints
Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.
You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?
When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?
About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.
FYI: the start-dfs.sh starts the secondary namenodes:
The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.
The slave nodes do not need the slaves file. You can skip it.
For `hdfs namenode -format`, it only need to be done on the master.
Eric,
I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.
Thanks a lot.
Great to here that! :)
Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?
You may try the official one first: https://hbase.apache.org/apache_hbase_reference_guide.pdf . I did not try it out myself. But it looks pretty well written.
Hi can we add a namenode to a running cluster.
If yes what would be the steps??
It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:
HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Hi,
What will be the standard OS (in linux) where I can perform the above steps to install it.
Thanks!
Dev
The tutorial does not reply on any specific Linux distro. CentOS 7, Fedora 12+, Ubuntu 12+ and more other distro should be good enough as long as the needed tools used are installed.