Hadoop Installation Tutorial (Hadoop 2.x)

ByEric Ma Sep 14, 2014Dec 29, 2019

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves). If you are not familiar with these names, please take a look at YARN architecture first.

First we assume that you have created a Linux user “hadoop” on each nodes that you will use for running Hadoop and the “hadoop” user’s home directory is /home/hadoop/.

Configure hostnames

Hadoop uses hostnames to identify nodes by default. So you should first give each host a name and allow other nodes to find the IP of the hosts with names. The most simple way may be adding the host to IP mappings in every nodes’ /etc/hosts file. For larger cluster, you may use a DNS service. Here, we use 3 nodes as the example and these lines are added to the /etc/hosts file on each node:

10.0.3.29   hofstadter
10.0.3.30   snell
10.0.3.31   biot

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found in Enabling Password-less ssh Login.

Install software needed by Hadoop

The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.

Java JDK

Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.

As an example in this tutorial, the JDK is installed into

/usr/java/default/

You may need to make soft link to /usr/java/default from the actual location where you installed JDK.

Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:

export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH

Hadoop

Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop 2.5.0.

You can unpack the tar ball to a directory. In this example, we unpack it to

/home/hadoop/hadoop/

which is a directory under the hadoop Linux user’s home directory.

The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.

Configure environment variables for the “hadoop” user

We assume the “hadoop” user uses bash as its shell.

Add these lines at the bottom of ~/.bashrc on all nodes:

export HADOOP_COMMON_HOME=$HOME/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/sbin

The last 2 lines add hadoop’s bin directory to the PATH so that we can directly run hadoop’s commands without specifying the full path to it.

Configure Hadoop

The configuration files for Hadoop is under $HADOOP_COMMON_HOME/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.

core-site.xml

Here the NameNode runs on biot.

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://biot/</value>
    <description>NameNode URI</description>
  </property>

yarn-site.xml

The YARN ResourceManager runs on biot and supports MapReduce shuffle.

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>biot</value>
  <description>The hostname of the ResourceManager</description>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>shuffle service for MapReduce</description>
</property>

hdfs-site.xml

The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/hadoop/hdfs/</value>
        <description>DataNode directory for storing data chunks.</description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/hadoop/hdfs/</value>
        <description>NameNode directory for namespace and transaction logs storage.</description>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <description>Number of replication for each chunk.</description>
    </property>

mapred-site.xml

First copy mapred-site.xml.template to mapred-site.xml and add the following content.

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>Execution framework.</description>
</property>

slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

hofstadter
snell
biot

Duplicate Hadoop configuration files to all nodes

We may duplicate the Hadoop directory and configuration files under the etc/hadoop directory to all nodes. You may use this script to duplicate the hadoop directory:

cd
for i in `cat hadoop/etc/hadoop/slaves`; do 
  echo $i; rsync -avxP --exclude=logs hadoop/ $i:hadoop/; 
done

By now, we have finished copying Hadoop software and configuring the Hadoop. Now let’s have some fun with Hadoop.

Format a new distributed filesystem

Format a new distributed file system by

hdfs namenode -format

Start HDFS/YARN

Manually

Start the HDFS with the following command, run on the designated NameNode:

hadoop-daemon.sh --script hdfs start namenode

Run the script to start DataNodes on each slave:

hadoop-daemon.sh --script hdfs start datanode

Start the YARN with the following command, run on the designated ResourceManager:

yarn-daemon.sh start resourcemanager

Run a script to start NodeManagers on all slaves:

yarn-daemon.sh start nodemanager

By scripts

start-dfs.sh
start-yarn.sh

Check status

Check HDFS status

hdfs dfsadmin -report

It should show a report like:


Configured Capacity: 158550355968 (147.66 GB)
Present Capacity: 11206017024 (10.44 GB)
DFS Remaining: 11205943296 (10.44 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 10.0.3.30:50010 (snell)
Hostname: snell
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49800732672 (46.38 GB)
DFS Remaining: 3049361408 (2.84 GB)
DFS Used%: 0.00%
DFS Remaining%: 5.77%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014

Name: 10.0.3.31:50010 (biot)
Hostname: biot
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49513574400 (46.11 GB)
DFS Remaining: 3336519680 (3.11 GB)
DFS Used%: 0.00%
DFS Remaining%: 6.31%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014

Name: 10.0.3.29:50010 (hofstadter)
Hostname: hofstadter
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 48030068736 (44.73 GB)
DFS Remaining: 4820025344 (4.49 GB)
DFS Used%: 0.00%
DFS Remaining%: 9.12%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014

Check YARN status

yarn node -list

It should show a report like:

Total Nodes:3
         Node-Id         Node-State Node-Http-Address   Number-of-Running-Containers
hofstadter:43469            RUNNING   hofstadter:8042                              0
     snell:57039            RUNNING        snell:8042                              0
      biot:52834            RUNNING         biot:8042                              0

Run basic tests

Let’s play with grep as the basic test of Hadoop YARN/HDFS/MapReduce.

First, create the directory to store data for the hadoop user.

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop

Then, put the configuration file directory as the input.

hadoop fs -put /home/hadoop/hadoop/etc/hadoop /user/hadoop/hadoop-config

or simply refer the directory under the hadoop user’s HDFS home (check the the discussion, thanks to Thirumal Venkat for this tip):

hadoop fs -put /home/hadoop/hadoop/etc/hadoop hadoop-config

Let’s ls it to check the content:

hdfs dfs -ls /user/hadoop/hadoop-config

It should print out the results like follows.

Found 26 items
-rw-r--r--   3 hadoop supergroup       3589 2014-09-05 07:53 /user/hadoop/hadoop-config/capacity-scheduler.xml
-rw-r--r--   3 hadoop supergroup       1335 2014-09-05 07:53 /user/hadoop/hadoop-config/configuration.xsl
-rw-r--r--   3 hadoop supergroup        318 2014-09-05 07:53 /user/hadoop/hadoop-config/container-executor.cfg
-rw-r--r--   3 hadoop supergroup        917 2014-09-05 07:53 /user/hadoop/hadoop-config/core-site.xml
-rw-r--r--   3 hadoop supergroup       3589 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.cmd
-rw-r--r--   3 hadoop supergroup       3443 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.sh
-rw-r--r--   3 hadoop supergroup       2490 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics.properties
-rw-r--r--   3 hadoop supergroup       1774 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics2.properties
-rw-r--r--   3 hadoop supergroup       9201 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-policy.xml
-rw-r--r--   3 hadoop supergroup        775 2014-09-05 07:53 /user/hadoop/hadoop-config/hdfs-site.xml
-rw-r--r--   3 hadoop supergroup       1449 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-env.sh
-rw-r--r--   3 hadoop supergroup       1657 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-log4j.properties
-rw-r--r--   3 hadoop supergroup         21 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-signature.secret
-rw-r--r--   3 hadoop supergroup        620 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-site.xml
-rw-r--r--   3 hadoop supergroup      11118 2014-09-05 07:53 /user/hadoop/hadoop-config/log4j.properties
-rw-r--r--   3 hadoop supergroup        918 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.cmd
-rw-r--r--   3 hadoop supergroup       1383 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.sh
-rw-r--r--   3 hadoop supergroup       4113 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-queues.xml.template
-rw-r--r--   3 hadoop supergroup        887 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml
-rw-r--r--   3 hadoop supergroup        758 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml.template
-rw-r--r--   3 hadoop supergroup         22 2014-09-05 07:53 /user/hadoop/hadoop-config/slaves
-rw-r--r--   3 hadoop supergroup       2316 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-client.xml.example
-rw-r--r--   3 hadoop supergroup       2268 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-server.xml.example
-rw-r--r--   3 hadoop supergroup       2178 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.cmd
-rw-r--r--   3 hadoop supergroup       4567 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.sh
-rw-r--r--   3 hadoop supergroup       1007 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-site.xml

Now, let’s run the grep now:

cd
hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/hadoop-config /user/hadoop/output 'dfs[a-z.]+'

It will print status as follows if everything works well.

14/09/05 07:54:36 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:37 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:37 INFO input.FileInputFormat: Total input paths to process : 26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: number of splits:26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0001
14/09/05 07:54:37 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:38 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0001
14/09/05 07:54:38 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0001/
14/09/05 07:54:38 INFO mapreduce.Job: Running job: job_1409903409779_0001
14/09/05 07:54:45 INFO mapreduce.Job: Job job_1409903409779_0001 running in uber mode : false
14/09/05 07:54:45 INFO mapreduce.Job:  map 0% reduce 0%
14/09/05 07:54:50 INFO mapreduce.Job:  map 23% reduce 0%
14/09/05 07:54:52 INFO mapreduce.Job:  map 81% reduce 0%
14/09/05 07:54:53 INFO mapreduce.Job:  map 100% reduce 0%
14/09/05 07:54:56 INFO mapreduce.Job:  map 100% reduce 100%
14/09/05 07:54:56 INFO mapreduce.Job: Job job_1409903409779_0001 completed successfully
14/09/05 07:54:56 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=319
        FILE: Number of bytes written=2622017
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=65815
        HDFS: Number of bytes written=405
        HDFS: Number of read operations=81
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=26
        Launched reduce tasks=1
        Data-local map tasks=26
        Total time spent by all maps in occupied slots (ms)=116856
        Total time spent by all reduces in occupied slots (ms)=3000
        Total time spent by all map tasks (ms)=116856
        Total time spent by all reduce tasks (ms)=3000
        Total vcore-seconds taken by all map tasks=116856
        Total vcore-seconds taken by all reduce tasks=3000
        Total megabyte-seconds taken by all map tasks=119660544
        Total megabyte-seconds taken by all reduce tasks=3072000
    Map-Reduce Framework
        Map input records=1624
        Map output records=23
        Map output bytes=566
        Map output materialized bytes=469
        Input split bytes=3102
        Combine input records=23
        Combine output records=12
        Reduce input groups=10
        Reduce shuffle bytes=469
        Reduce input records=12
        Reduce output records=10
        Spilled Records=24
        Shuffled Maps =26
        Failed Shuffles=0
        Merged Map outputs=26
        GC time elapsed (ms)=363
        CPU time spent (ms)=15310
        Physical memory (bytes) snapshot=6807674880
        Virtual memory (bytes) snapshot=32081272832
        Total committed heap usage (bytes)=5426970624
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=62713
    File Output Format Counters
        Bytes Written=405
14/09/05 07:54:56 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:56 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:56 INFO input.FileInputFormat: Total input paths to process : 1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: number of splits:1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0002
14/09/05 07:54:56 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:56 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0002
14/09/05 07:54:56 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0002/
14/09/05 07:54:56 INFO mapreduce.Job: Running job: job_1409903409779_0002
14/09/05 07:55:02 INFO mapreduce.Job: Job job_1409903409779_0002 running in uber mode : false
14/09/05 07:55:02 INFO mapreduce.Job:  map 0% reduce 0%
14/09/05 07:55:07 INFO mapreduce.Job:  map 100% reduce 0%
14/09/05 07:55:12 INFO mapreduce.Job:  map 100% reduce 100%
14/09/05 07:55:13 INFO mapreduce.Job: Job job_1409903409779_0002 completed successfully
14/09/05 07:55:13 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=265
        FILE: Number of bytes written=193601
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=527
        HDFS: Number of bytes written=179
        HDFS: Number of read operations=7
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=2616
        Total time spent by all reduces in occupied slots (ms)=2855
        Total time spent by all map tasks (ms)=2616
        Total time spent by all reduce tasks (ms)=2855
        Total vcore-seconds taken by all map tasks=2616
        Total vcore-seconds taken by all reduce tasks=2855
        Total megabyte-seconds taken by all map tasks=2678784
        Total megabyte-seconds taken by all reduce tasks=2923520
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Map output bytes=239
        Map output materialized bytes=265
        Input split bytes=122
        Combine input records=0
        Combine output records=0
        Reduce input groups=5
        Reduce shuffle bytes=265
        Reduce input records=10
        Reduce output records=10
        Spilled Records=20
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=20
        CPU time spent (ms)=2090
        Physical memory (bytes) snapshot=415334400
        Virtual memory (bytes) snapshot=2382364672
        Total committed heap usage (bytes)=401997824
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=405
    File Output Format Counters
        Bytes Written=179

After the grep execution finishes, we can check the results. We can check the content in the output directory by:

hdfs dfs -ls /user/hadoop/output/

It should print output as follows.

Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2014-09-05 07:55 /user/hadoop/output/_SUCCESS
-rw-r--r--   3 hadoop supergroup        179 2014-09-05 07:55 /user/hadoop/output/part-r-00000

part-r-00000 contains the result. Let’s cat it out to check the content.

hdfs dfs -cat /user/hadoop/output/part-r-00000

It should print output as follows.

6   dfs.audit.logger
4   dfs.class
3   dfs.server.namenode.
2   dfs.period
2   dfs.audit.log.maxfilesize
2   dfs.audit.log.maxbackupindex
1   dfsmetrics.log
1   dfsadmin
1   dfs.servers
1   dfs.file

Stop HDFS/YARN

Manually

On the ResourceManager and NodeManager nodes:

yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager

On the NameNode:

hadoop-daemon.sh --script hdfs stop namenode

On each DataNodes:

hadoop-daemon.sh --script hdfs stop datanode

With scripts

stop-yarn.sh
stop-dfs.sh

Debug information

You may check logs for debugging. The logs on each nodes are under:

hadoop/logs/

You may want to cleanup everything on all nodes. To remove data directories on data node (if you did not set the hdfs-site.xml to choose the directories by yourself). (Be careful, the following scripts will delete everything in /tmp that your current user can delete. You may adapt it if you stores some useful data under /tmp .)

rm -rf /tmp/* ~/hadoop/logs/*
for i in `cat hadoop/etc/hadoop/slaves`; do 
  echo $i; ssh $i 'rm -rf /tmp/* ~/hadoop/logs/'; 
done

Additional notes

Control number of containers

You may want to configure the number of containers managed by YARN on each nodes. You can refer to this example below.

The following lines are added to yarn-site.xml to specify that each node uses 3072MB memory and each container uses at least 1536MB memory. That is, at most 2 containers on each node.

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>3072</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1536</value>
</property>

How to open a URL in a new window in JavaScript?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to open a URL in a new window in JavaScript? For example, to open a new window to browse “http://www.systutorials.com“. Open the URL in a new window: url = “http://www.systutorials.com”; window.open(url); Some browser will open the window in a new tab depending on the configuration. If you want to force the browser to open…

Printing a file in hexadecimal format of all content on Linux?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to print a binary file in hexadecimal format of all content? Use xxd: xxd file.txt or xxd -p file.txt Read more: Printing Integers in Hexadecimal Format in Python How to sort a file by hexadecimal numbers on Linux using sort command? How to read the whole file content into a string in Go? How…

How to host multiple websites on a Apache Linux web server?

ByEric Ma Mar 24, 2018Mar 24, 2018

I have a Linux web server with Apache (httpd). How to host multiple websites on Apache? To host websites www.example1.com and www.example2.com on a single node with Apache2 on a single IP, you can use VirtualHost as follows. Edit /etc/httpd/conf/httpd.conf and add the following lines: NameVirtualHost *:80 <VirtualHost *:80> DocumentRoot /var/www/example1.com ServerName www.example1.com # Other…

svn: how to clean up my repository directory?

ByQ A Mar 24, 2018

I have a local svn repository. I work in the repository, e.g. compiling latex documents and building software packages, and there are many useless temporary files there. How can I clean it up by automatically clean these temporary files like the git clean -f in git? svn seems has no built-in function for this. But…

How to print a line to STDERR and STDOUT in C?

ByQ A Mar 24, 2018

In C, how to print a string as a line to STDOUT? That is, the string and the newline character, nicely? And similarly, how to print the line to STDERR? In C, to print to STDOUT, you may do this: printf(“%sn”, your_str); For example, $ cat t.c #include <stdio.h> void main() { printf(“%sn”, “hello world!”);…

How to estimate the memory usage of HDFS NameNode for a HDFS cluster?

ByEric Ma Mar 24, 2018Mar 24, 2018

HDFS stores the metadata of files and blocks in the memory of the NameNode. How to estimate the memory usage of HDFS NameNode for a HDFS cluster? Each file and each block has around 150 bytes of metadata on NameNode. So you may do the calculation based on this. For examples, assume block size is…

34 Comments

Thirumal Venkat says:

Sep 14, 2014 at 10:46 am

I believe if you use /user/hadoop instead you can directly access folders inside it similar to your home directory in HDFS.

Ex:
/user/hadoop/input can be just referenced as input while accessing HDFS.

Reply
1. Eric Zhiqiang Ma says:
  
  Sep 15, 2014 at 2:14 am
  
  On my cluster:
  
  $ hdfs dfs -ls /user/hadoop/ ls: `/user/hadoop/': No such file or directory
  and
  
  $ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gz put: `hadoop-2.5.0.tar.gz': No such file or directory
  
  It seems that unless the /user/hadoop/ directory is not automatically created.
  
  After it is created, it can be just referenced as you posted. Nice!
  
  BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup
  
  171 @Override 172 public Path getHomeDirectory() { 173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName())); 174 }
  
  Reply
Luke says:

Oct 24, 2014 at 3:01 am

Thank you very much for this guide. The apache docs are not easy to follow .

Reply
md says:

Nov 26, 2014 at 8:38 pm

Dear Eric Zhiqiang,
I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

Reply
1. Eric Zhiqiang Ma says:
  
  Nov 26, 2014 at 9:25 pm
  
  Hi md,
  
  1. Yes.
  
  2. I used the NameNode in this tutorial as the “client node”.
  
  3. No if you do not want to run DataNode on the NameNode node.
  
  4. I suggest using Oracle JVM. But 1.7.0_09-icedtea seems reported “Good” too: https://wiki.apache.org/hadoop/HadoopJavaVersions .
  
  Reply
  1. md says:
    
    Nov 26, 2014 at 10:24 pm
    
    Thanks Eric Zhiqiang for your quick response.
    
    Reply
Eric Zhiqiang Ma says:

Nov 29, 2014 at 12:35 am

It seems the DataNodes are not identified by the NameNode.

1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.

2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: https://www.systutorials.com/qa/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).

3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.

Hope these tips help.

Reply
md says:

Nov 28, 2014 at 9:33 pm

Hi Eric,

I am getting the following error when trying the check the HDFS status on namenode or datanode:

[hadoop@namenode ~]$ hdfs dfsadmin -report
14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

————————————————-
Can you please suggest me a solution.
Thanks in advance.

Reply
md says:

Nov 29, 2014 at 2:19 am

Hi Eric,

Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
1. Stop the cluster
2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
4. Restart the cluster
Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started

Reply
1. Eric Zhiqiang Ma says:
  
  Nov 29, 2014 at 5:07 am
  
  Hi md,
  
  Noted. Thanks for sharing.
  
  Reply
Eric Zhiqiang Ma says:

Dec 7, 2014 at 9:28 pm

NameNode metadata is critical for the whole HDFS cluster. If you use a single node for the NameNode, make replicas of the metadata on 2 or more separated disks for higher data reliability.

Please check this post for more information on how to replicate and set up 2-disk metadata storage for the NameNode:

https://www.systutorials.com/qa/1315/add-new-hdfs-namenode-metadata-directory-existing-cluster

Reply
John says:

Dec 10, 2014 at 4:53 pm

I had better luck defining JAVA_HOME in hadoop_env.sh

Reply
1. Eric Zhiqiang Ma says:
  
  Dec 10, 2014 at 9:29 pm
  
  That’s better if you use the Java of a version different from the global one.
  
  Reply
Saravanan says:

Dec 29, 2014 at 7:11 pm

Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos

Reply
Chee says:

Jan 12, 2015 at 11:34 pm

Hi Eric,

Very useful blog. Just wondering about containers – do you have more details on them. For example:
1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?

Thanks
C.Chee

Reply
1. Eric Zhiqiang Ma says:
  
  Jan 20, 2015 at 6:18 am
  
  You may check the “Architecture of Next Generation Apache Hadoop MapReduce
  Framework”:
  
  https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf
  
  The “Resource Model” discussed the mode for YARN v1.0:
  
  The Scheduler models only memory in YARN v1.0. Every node in the system is considered to be composed of multiple containers of minimum size of memory (say 512MB or 1 GB). The ApplicationMaster can request any container as a multiple of the minimum memory size.
  
  Eventually we want to move to a more generic resource model, however, for Yarn v1 we propose a rather straightforward model:
  
  The resource model is completely based on memory (RAM) and every node is made up discreet chunks of memory.
  
  For the implementation, you may need to dive into the source code tree.
  
  Reply
Pingback: Installation of Hadoop | HadoopYoda Blog
Tariq says:

Feb 5, 2015 at 4:22 am

Thaanks for the tutorial. I have one question and one issu requiring your help.

Is the value mapreduce_shuffle or mapreduce.shuffle?

yarn.nodemanager.aux-services
mapreduce_shuffle
shuffle service for MapReduce

I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.

Could you please help me to get out of this exception.

15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Reply
1. Eric Zhiqiang Ma says:
  
  Feb 5, 2015 at 6:57 am
  
  Hi Tariq,
  
  It is “mapreduce_shuffle”.
  
  Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
  
  I have no idea what’s the reason for the error.
  
  The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.
  
  Reply
Tariq says:

Feb 19, 2015 at 7:32 am

Dear Eric,

I am having problem with YARN/Mapred Configuration. I have asked these question at stackoverflow. Could you please have a look at these question and answer them, if possible. Thanks
http://stackoverflow.com/questions/28586561/yarn-container-lauch-failed-exception-and-mapred-site-xml-configuration

http://stackoverflow.com/questions/28609639/yarn-container-configuration-for-javacv

Regards,

Reply
1. Eric Zhiqiang Ma says:
  
  Feb 21, 2015 at 4:19 am
  Hi Tariq, just noticed your comment.
  
  I do ever experience the `exitCode=134` problem once.
  
  My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:
```
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>2048</value>
</property>
```
  What I did is only this.
  
  You may check how much memory your program uses in one task and set the value to be larger than that.
  Reply
Shrenik Gala says:

Feb 20, 2015 at 5:54 pm

Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.

Reply
shekar says:

Mar 20, 2015 at 4:39 am

HI Eric,

Please help me ,I am unable to run my first basic example,i am getting below message
“15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”

Thanks
Shekar M

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 20, 2015 at 4:45 am
  Make sure `hdfs dfs -ls /usr/local/hadoop/input` exists. As Hadoop prints
```
Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
```
  Reply
Dean Schulze says:

Mar 23, 2015 at 7:36 pm

Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.

You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?

When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 23, 2015 at 9:24 pm
  About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.
  
  FYI: the start-dfs.sh starts the secondary namenodes:
```
# secondary namenodes (if any)

SECONDARY_NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -secondarynamenodes 2>/dev/null)
```
  The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.
  
  The slave nodes do not need the slaves file. You can skip it.
  
  For `hdfs namenode -format`, it only need to be done on the master.
  Reply
Nandu says:

Mar 24, 2015 at 1:28 pm

Eric,

I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.

Thanks a lot.

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 24, 2015 at 6:25 pm
  
  Great to here that! :)
  
  Reply
Dean Schulze says:

Mar 26, 2015 at 5:12 am

Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 26, 2015 at 6:28 am
  
  You may try the official one first: https://hbase.apache.org/apache_hbase_reference_guide.pdf . I did not try it out myself. But it looks pretty well written.
  
  Reply
Shefali says:

Jun 2, 2016 at 6:38 pm

Hi can we add a namenode to a running cluster.
If yes what would be the steps??

Reply
1. Eric Z Ma says:
  
  Jun 7, 2016 at 6:34 pm
  
  It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:
  
  HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
  HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
  
  Reply
Dev Mukherjee says:

Jun 8, 2017 at 6:30 pm

Hi,

What will be the standard OS (in linux) where I can perform the above steps to install it.

Thanks!
Dev

Reply
1. Eric says:
  
  Jun 8, 2017 at 11:40 pm
  
  The tutorial does not reply on any specific Linux distro. CentOS 7, Fedora 12+, Ubuntu 12+ and more other distro should be good enough as long as the needed tools used are installed.
  
  Reply

Configure hostnames

Enable “hadoop” user to password-less SSH login to slaves

Install software needed by Hadoop

Java JDK

Hadoop

Configure environment variables for the “hadoop” user

Configure Hadoop

core-site.xml

yarn-site.xml

hdfs-site.xml

mapred-site.xml

slaves

Duplicate Hadoop configuration files to all nodes

Format a new distributed filesystem

Start HDFS/YARN

Manually

By scripts

Check status

Check HDFS status

Check YARN status

Run basic tests

Stop HDFS/YARN

Manually

With scripts

Debug information

More readings on HDFS/YARN/Hadoop

Additional notes

Control number of containers

Similar Posts

34 Comments

Leave a Reply Cancel reply