How to set the data replication factor of Hadoop HDFS?

Posted on

How to set the data replication factor of Hadoop HDFS in Hadoop 2 (YARN)? The default replication factor in HDFS is controlled by the dfs.replication property. The value is 3 by default. To change the replication factor, you can add a dfs.replication property settings in the hdfs-site.xml configuration file of Hadoop: <property> <name>dfs.replication</name> <value>1</value> <description>Replication
Read more

Hadoop 2 (YARN) default configuration values

Posted on

Where to check the default Hadoop 2 (YARN) configuration values for: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml Default Hadoop 2 (YARN) configuration values for Hadoop 2.2.0 from Apache Hadoop website: HDFS: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN: https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce: https://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

SQL layers on NoSQL databases

Posted on

What are the SQL layer solution over NoSQL databases such as key/value stores? Phoenix: A SQL layer on HBase: https://github.com/forcedotcom/phoenix They also show some performance results: https://github.com/forcedotcom/phoenix/wiki/Performance F1 – The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business: http://research.google.com/pubs/pub38125.html With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding,
Read more

How to force a metadata checkpointing in HDFS

Posted on

The metadata checkpointing in HDFS is done by the Secondary NameNode to merge the fsimage and the edits log files periodically and keep edits log size within a limit. For various reasons, the checkpointing by the Secondary NameNode may fail. For one example, HDFS SecondaraNameNode log shows errors in its log as follows. 2017-08-06 10:54:14,488
Read more

Hadoop Installation Tutorial (Hadoop 2.x)

Posted on

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data
Read more

Hadoop MapReduce Tutorials

Posted on

Here is a list of tutorials for learning how to write MapReduce programs on Hadoop, the opensource MapReduce implementation with HDFS. MapReduce Tutorials The official tutorial on Hadoop MapReduce framework: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html. Yahoo! Hadoop Tutorial A comprehensive tutorial on Hadoop from Yahoo! Developer Network: http://developer.yahoo.com/hadoop/tutorial/. More about MapReduce To better understand the design behind MapReduce, it
Read more

Large-scale Data Storage and Processing System in Datacenters

Posted on

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)
Read more

Hadoop Installation Tutorial (Hadoop 1.x)

Posted on

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed
Read more

A Simple Sort Benchmark on Hadoop

Posted on

After [[hadoop-installation-tutorial|installing Hadoop]], we usually run some benchmark programs to test whether the system works well. In the post of the Hadoop install tutorial, we show a very simple to grep strings from a simple sets of files. In this post, we introduce the Sort for testing and benchmarking Hadoop. The Sort program is also
Read more

mrcc – A Distributed C Compiler System on MapReduce

Posted on

The mrcc project’s homepage is here: mrcc project. Abstract mrcc is an open source compilation system that uses MapReduce to distribute C code compilation across the servers of the cloud computing platform. mrcc is built to use Hadoop by default, but it is easy to port it to other could computing platforms, such as MRlite,
Read more