A file may be splitted to many chunks and replications stored on many datanodes in HDFS. Now, the question is how to find the DataNodes that actually store a file in HDFS? You may use the dfsadmin -fsck tool from the Hadoop hdfs util. Here is an example: $ hadoop fsck /user/aaa/file.name -files -locations -blocks
Read more
Tag: hadoop
How to write /etc/fstab entry for –bind mounting?
Posted onHow to write /etc/fstab entry for –bind mounting like mount –bind /home/hadoop/hdfs/store-tmp /home/store/tmp From man 8 mount: Since Linux 2.4.0 it is possible to remount part of the file hierarchy somewhere else. The call is mount –bind olddir newdir or shortoption mount -B olddir newdir or fstab entry is: /olddir /newdir none bind
What’s the difference between Reliability, Durability, and Availability for data storage system?
Posted onSome important concepts in distributed system like Hadoop distributed file system, Google file system and so on. Answer from http://www.quora.com/Whats-the-difference-between-Reliability-Durability-and-Availability-for-data-storage-system The difference between durability and availability is fairly simple. Durability is about what happens when all power goes out everywhere. Has all data been written to stable storage that doesn’t require power (e.g. disk/flash), in
Read more
How to change number of replications of certain files in HDFS?
Posted onThe HDFS has a configuration in hdfs-site.xml to set the global replication number of blocks with the “dfs.replication” property. However, there are some “hot” files that are access by many nodes. How to increase the number of blocks for these certain files in HDFS? You can the replication number of certain file to 10: hdfs
Read more
How to get logs of a specific time range on Linux?
Posted onThe logs I am processing is Hadoop log (log4j). It is in format like: 2014-09-20 21:55:11,855 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 36 2014-09-20 21:55:11,863 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 55 2014-09-20 22:10:11,907 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now 2014-09-20 22:10:11,907 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID mapping because ‘/etc/nfs.map’ does not exist. Now, I
Read more
Making Hadoop Java process heap larger?
Posted onIn Hadoop 2.5.0, I use ‘ps -aux’ and find the Java process has options: -Xmx1000m However, my nodes have 32GB memory. How to make Hadoop Java process heap larger? In yarn-env.sh, you can find: # For setting YARN specific HEAP sizes please use this # Parameter and set appropriately # YARN_HEAPSIZE=1000 In hadoop-env.sh, you can
Read more
How to set the number of mappers and reducers of Hadoop in command line?
Posted onHow to set the number of mappers and reducers of Hadoop in command line? Number of mappers and reducers can be set like (5 mappers, 2 reducers): -D mapred.map.tasks=5 -D mapred.reduce.tasks=2 in the command line. In the code, one can configure JobConf variables. job.setNumMapTasks(5); // 5 mappers job.setNumReduceTasks(2); // 2 reducers Note that on Hadoop
Read more
How to set the data replication factor of Hadoop HDFS?
Posted onHow to set the data replication factor of Hadoop HDFS in Hadoop 2 (YARN)? The default replication factor in HDFS is controlled by the dfs.replication property. The value is 3 by default. To change the replication factor, you can add a dfs.replication property settings in the hdfs-site.xml configuration file of Hadoop: <property> <name>dfs.replication</name> <value>1</value> <description>Replication
Read more
Hadoop 2 (YARN) default configuration values
Posted onWhere to check the default Hadoop 2 (YARN) configuration values for: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml Default Hadoop 2 (YARN) configuration values for Hadoop 2.2.0 from Apache Hadoop website: HDFS: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN: https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce: https://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Good introductions to Hadoop 2.0 (YARN)?
Posted onWhich ones are recommended introductions to Hadoop 2.0 (YARN)? Pointers to webpages are good. Those are good ones that I find: The SoCC13 paper “Apache Hadoop YARN: Yet Another Resource Negotiator” by Vinod Kumar Vavilapalli et al.: http://www.socc2013.org/home/program/a5-vavilapalli.pdf The introduction from Hortonworks by Arun Murthy:http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/ The “Official” one from Apache Hadoop website (very brief):https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html
Classpath for compiling MapReduce jobs on Hadoop 2.2.0
Posted onHow to get the correct classpath for compiling MapReduce jobs on Hadoop 2.2.0 (YARN)? The yarn command from Hadoop 2 can find it out for you: yarn classpath You may add the full path to yarn which is under bin directory of the Hadoop distribution pachage, if it is not in your $PATH.
How to choose the number of mappers and reducers in Hadoop
Posted onHow to choose the number of mappers and reducers in Hadoop to get good job performance? The Hadoop Wiki gives a discussion on this: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Some valuable points: About the number of Maps: The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to
Read more
SQL layers on NoSQL databases
Posted onWhat are the SQL layer solution over NoSQL databases such as key/value stores? Phoenix: A SQL layer on HBase: https://github.com/forcedotcom/phoenix They also show some performance results: https://github.com/forcedotcom/phoenix/wiki/Performance F1 – The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business: http://research.google.com/pubs/pub38125.html With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding,
Read more
How to force a metadata checkpointing in HDFS
Posted onThe metadata checkpointing in HDFS is done by the Secondary NameNode to merge the fsimage and the edits log files periodically and keep edits log size within a limit. For various reasons, the checkpointing by the Secondary NameNode may fail. For one example, HDFS SecondaraNameNode log shows errors in its log as follows. 2017-08-06 10:54:14,488
Read more
Hadoop Installation Tutorial (Hadoop 2.x)
Posted onHadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data
Read more
Big Data Benchmark from AMPLab of UC Berkeley
Posted onBenchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark The
Read more
Hadoop MapReduce Tutorials
Posted onHere is a list of tutorials for learning how to write MapReduce programs on Hadoop, the opensource MapReduce implementation with HDFS. MapReduce Tutorials The official tutorial on Hadoop MapReduce framework: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html. Yahoo! Hadoop Tutorial A comprehensive tutorial on Hadoop from Yahoo! Developer Network: http://developer.yahoo.com/hadoop/tutorial/. More about MapReduce To better understand the design behind MapReduce, it
Read more
PUMA: A MapReduce Benchmark Suite
Posted onMapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite
Read more
Hadoop TeraSort Benchmark
Posted onTeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running
Read more
Large-scale Data Storage and Processing System in Datacenters
Posted onResearch on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)
Read more