yarn

QA

How to find out all files with replication factor 1 in HDFS?
ByEric Ma Mar 24, 2018Mar 24, 2018

How to find out all files with replication factor 1 in HDFS? The hdfs dfsadmin -report shows there are blocks with replication factor 1: Missing blocks (with replication factor 1): 7 How to find them out? You can run hdfs fsck to list all files with their replication counts and grep those with replication factor…

Read More How to find out all files with replication factor 1 in HDFS?
QA

how change my policy of scheduling in hadoop?
ByEric Ma Mar 24, 2018Oct 7, 2019

I want to change policy of scheduling in Hadoop, how to I can change job order in map reduce automatically. Assume you are using Hadoop 2 / YARN. The configuration parameter named yarn.resourcemanager.scheduler.class controls the class to be used as the resource scheduler for YARN/Hadoop. The default value for the scheduler class (check more at…

Read More how change my policy of scheduling in hadoop?
QA

How to manually kill HDFS DataNodes?
ByEric Ma Mar 24, 2018Mar 24, 2018

stop-dfs.sh report that there are no datanodes running on some nodes like hdfs-node-000208: no datanode to stop However, there are DataNode process running there. How to clean these processes on many (100s) of nodes? You may use this piece of bash script: for i in `cat hadoop/etc/hadoop/slaves`; do echo $i; ssh $i ‘jps | grep…

Read More How to manually kill HDFS DataNodes?
QA

How to set the replication factor for one file when it is uploaded by `hdfs dfs -put` command line in HDFS?
ByEric Ma Mar 24, 2018Mar 24, 2018

When uploading a file by the hdfs dfs -put command line in HDFS, how to set a replication factor instead of the global one for that file? For example, HDFS’s global replication factor is 3. For some temporary files, I would like to save just one copy for faster uploading and saving disk space. The…

Read More How to set the replication factor for one file when it is uploaded by `hdfs dfs -put` command line in HDFS?
Storage systems | Systems

How to handle missing blocks and blocks with corrupt replicas in HDFS?
ByEric Ma Mar 24, 2018Feb 20, 2020

One of HDFS cluster’s hdfs dfsadmin -report reports: Under replicated blocks: 139016 Blocks with corrupt replicas: 9 Missing blocks: 0 The “Under replicated blocks” can be re-replicated automatically after some time. How to handle the missing blocks and blocks with corrupt replicas in HDFS? Understanding these blocks A block is “with corrupt replicas” in HDFS…

Read More How to handle missing blocks and blocks with corrupt replicas in HDFS?
QA

HDFS stays in safe mode because of reported blocks not reaching 0.9990 of total blocks
ByEric Ma Mar 24, 2018Feb 9, 2019

After a node failure and restarting the HDFS, the NameNode reports: “The reported blocks 1968810 needs additional 5071 blocks to reach the threshold 0.9990 of total blocks 1975856. Safe mode will be turned off automatically.” in the log. Why this happens? And how to fix it? About why the NameNode stays in the safe mode:…

Read More HDFS stays in safe mode because of reported blocks not reaching 0.9990 of total blocks
QA

how to skip mapper function in hadoop
ByEric Ma Mar 24, 2018Mar 28, 2018

In hadoop I need to skip mapper function and directly execute the reducer function. We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to…

Read More how to skip mapper function in hadoop
QA

How to set replication factors for HDFS directories?
ByEric Ma Mar 24, 2018Mar 24, 2018

Is it possible to set the replication factor for specific directory in HDFS to be one that is different from the default replication factor? This should set the existing files’ replication factors but also new files created in the specific directory. This can simplify the administration. We can set the replication factor of /tmp/ to…

Read More How to set replication factors for HDFS directories?
QA

How to check the replication factor of a file in HDFS?
ByEric Ma Mar 24, 2018Mar 24, 2018

A related question: how to find the replication factors of files in a HDFS cluster? method 1: You can use the HDFS command line to ls the file. The second column of the output will show the replication factor of the file. For example, $ hdfs dfs -ls /usr/GroupStorage/data1/out.txt -rw-r–r– 3 hadoop zma 11906625598 2014-10-22…

Read More How to check the replication factor of a file in HDFS?
QA

Making Hadoop Java process heap larger?
ByEric Ma Mar 24, 2018Mar 24, 2018

In Hadoop 2.5.0, I use ‘ps -aux’ and find the Java process has options: -Xmx1000m However, my nodes have 32GB memory. How to make Hadoop Java process heap larger? In yarn-env.sh, you can find: # For setting YARN specific HEAP sizes please use this # Parameter and set appropriately # YARN_HEAPSIZE=1000 In hadoop-env.sh, you can…

Read More Making Hadoop Java process heap larger?
QA

How to set the number of mappers and reducers of Hadoop in command line?
ByEric Ma Mar 24, 2018Feb 26, 2019

How to set the number of mappers and reducers of Hadoop in command line? Number of mappers and reducers can be set like (5 mappers, 2 reducers): -D mapred.map.tasks=5 -D mapred.reduce.tasks=2 in the command line. In the code, one can configure JobConf variables. job.setNumMapTasks(5); // 5 mappers job.setNumReduceTasks(2); // 2 reducers Note that on Hadoop…

Read More How to set the number of mappers and reducers of Hadoop in command line?
QA

How to set the data replication factor of Hadoop HDFS?
ByEric Ma Mar 24, 2018Mar 24, 2018

How to set the data replication factor of Hadoop HDFS in Hadoop 2 (YARN)? The default replication factor in HDFS is controlled by the dfs.replication property. The value is 3 by default. To change the replication factor, you can add a dfs.replication property settings in the hdfs-site.xml configuration file of Hadoop: <property> <name>dfs.replication</name> <value>1</value> <description>Replication…

Read More How to set the data replication factor of Hadoop HDFS?
QA

Hadoop 2 (YARN) default configuration values
ByEric Ma Mar 24, 2018Feb 26, 2019

Where to check the default Hadoop 2 (YARN) configuration values for: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml Default Hadoop 2 (YARN) configuration values for Hadoop 2.2.0 from Apache Hadoop website: HDFS: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN: https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce: https://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Read More Hadoop 2 (YARN) default configuration values
QA

Good introductions to Hadoop 2.0 (YARN)?
ByEric Ma Mar 24, 2018Feb 26, 2019

Which ones are recommended introductions to Hadoop 2.0 (YARN)? Pointers to webpages are good. Those are good ones that I find: The SoCC13 paper “Apache Hadoop YARN: Yet Another Resource Negotiator” by Vinod Kumar Vavilapalli et al.: http://www.socc2013.org/home/program/a5-vavilapalli.pdf The introduction from Hortonworks by Arun Murthy:http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/ The “Official” one from Apache Hadoop website (very brief):https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html

Read More Good introductions to Hadoop 2.0 (YARN)?
QA

Classpath for compiling MapReduce jobs on Hadoop 2.2.0
ByEric Ma Mar 24, 2018Mar 24, 2018

How to get the correct classpath for compiling MapReduce jobs on Hadoop 2.2.0 (YARN)? The yarn command from Hadoop 2 can find it out for you: yarn classpath You may add the full path to yarn which is under bin directory of the Hadoop distribution pachage, if it is not in your $PATH.

Read More Classpath for compiling MapReduce jobs on Hadoop 2.2.0
Computing systems | Resource management | Storage systems | Systems | Tutorial

Hadoop Installation Tutorial (Hadoop 2.x)
ByEric Ma Sep 14, 2014Dec 29, 2019

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data…

Read More Hadoop Installation Tutorial (Hadoop 2.x)