Cluster

Linux & Systems Administration

Setting Custom Replication Factor for Individual HDFS Files
ByEric Ma Mar 24, 2018Apr 13, 2026

When uploading files to HDFS using hdfs dfs -put, the replication factor defaults to the cluster-wide setting in hdfs-site.xml (typically 3). For temporary files, logs, or staging data, you often want a lower replication factor to reduce write latency and disk usage. Override Replication Factor at Upload Time Use the -D flag to pass HDFS…

Read More Setting Custom Replication Factor for Individual HDFS Files
Systems & Architecture

Diagnosing and Repairing Corrupt HDFS Blocks
ByEric Ma Mar 24, 2018Apr 11, 2026

When you see output like this from hdfs dfsadmin -report, you need to understand what each metric means and how to respond: Under replicated blocks: 139016 Blocks with corrupt replicas: 9 Missing blocks: 0 Understanding block states in HDFS is essential for cluster health. The distinction between these categories determines how urgent your response needs…

Read More Diagnosing and Repairing Corrupt HDFS Blocks
Programming Languages

Recovering HDFS from Safe Mode After DataNode Failures
ByEric Ma Mar 24, 2018Apr 12, 2026

When a NameNode restarts, it enters safe mode to rebuild its understanding of block-to-DataNode mappings. If DataNodes don’t report their blocks quickly enough or some don’t come back online, the NameNode may stay stuck in safe mode with a message like: The reported blocks 1968810 needs additional 5071 blocks to reach the threshold 0.9990 of…

Read More Recovering HDFS from Safe Mode After DataNode Failures
Linux & Systems Administration

Linux Server Boot Notifications via Email
ByEric Ma Mar 24, 2018Apr 13, 2026

Managing a cluster of servers, you’ll want to be notified when they come back online after restarts or unexpected reboots. Email notifications on boot let you track infrastructure changes and catch potential issues early. Using crontab @reboot The simplest approach is a crontab @reboot entry that runs once at startup. This executes before most services…

Read More Linux Server Boot Notifications via Email
Programming Languages

Configuring HDFS Replication Factors by Directory
ByEric Ma Mar 24, 2018Apr 12, 2026

HDFS doesn’t natively support directory-level replication factor inheritance. Even if you set a specific replication factor on a directory and its files, new files created in that directory will default to the cluster’s global dfs.replication setting (typically 3). This limitation can complicate multi-tier storage strategies where you want temporary or low-priority data on fewer replicas…

Read More Configuring HDFS Replication Factors by Directory
Programming Languages

Distributed Consensus Algorithms: Paxos vs. Raft
ByEric Ma Mar 24, 2018Apr 12, 2026

Distributed systems need consensus — a way for multiple nodes to agree on state despite failures and network partitions. Understanding the key algorithms behind consensus is essential for anyone designing reliable systems. Paxos and Its Variants Paxos is the foundational consensus algorithm, proven to tolerate Byzantine failures. It works in three phases: prepare, accept, and…

Read More Distributed Consensus Algorithms: Paxos vs. Raft
Languages & Frameworks

Adding a Secondary NameNode Metadata Directory to HDFS
ByEric Ma Mar 24, 2018Apr 12, 2026

Adding a second metadata directory to your HDFS NameNode increases reliability by maintaining synchronized replicas of the namespace and transaction logs across separate disks. This guide walks through the process safely. Prerequisites and Planning Before starting, verify your current configuration and plan the new directory location: grep -A2 “dfs.namenode.name.dir” $HADOOP_HOME/etc/hadoop/hdfs-site.xml The new directory should be…

Read More Adding a Secondary NameNode Metadata Directory to HDFS
Linux & Systems Administration

Checking HDFS File Replication Factor
ByEric Ma Mar 24, 2018Apr 13, 2026

When managing HDFS clusters, you often need to verify the replication factor of specific files to ensure data redundancy meets your requirements. Here are the practical methods to check this. Using hdfs dfs -ls The most straightforward way is to list the file with hdfs dfs -ls: hdfs dfs -ls /usr/GroupStorage/data1/out.txt Output: -rw-r–r– 3 hadoop…

Read More Checking HDFS File Replication Factor
Linux & Systems Administration

Adjusting HDFS Replication Factor on Live Clusters
ByEric Ma Mar 24, 2018Apr 13, 2026

When you need to increase data redundancy or reduce storage overhead on a running HDFS cluster, you’ll often need to adjust the replication factor. Before making changes, understand that HDFS replication works differently than you might expect. How HDFS Replication Factor Works The replication factor in HDFS is determined at write time by the client,…

Read More Adjusting HDFS Replication Factor on Live Clusters
Linux & Systems Administration

HDFS Snapshots: Architecture and Implementation Details
ByEric Ma Mar 24, 2018Apr 11, 2026

HDFS snapshots provide a way to create read-only point-in-time copies of the filesystem or specific directories without duplicating data. Understanding their design helps you implement efficient backup strategies and recover from accidental deletions in production clusters. How HDFS Snapshots Work Snapshots in HDFS use copy-on-write semantics to avoid duplicating data. When you create a snapshot,…

Read More HDFS Snapshots: Architecture and Implementation Details
Linux & Systems Administration

Balancing HDFS DataNode Storage
ByEric Ma Mar 24, 2018Apr 13, 2026

As nodes are added or removed from a Hadoop cluster, storage utilization becomes uneven across DataNodes. Some fill up while others remain mostly empty, leading to inefficient resource use and potential storage bottlenecks. HDFS provides the Balancer tool to redistribute blocks across DataNodes and even out disk usage. Understanding the Balancer The HDFS Balancer is…

Read More Balancing HDFS DataNode Storage
Linux & Systems Administration

Adjusting HDFS Replication Factor Per File
ByEric Ma Mar 24, 2018Apr 13, 2026

HDFS uses the dfs.replication property in hdfs-site.xml to set a global default replication factor for all blocks. However, you can override this on a per-file or per-directory basis using the hdfs dfs -setrep command — useful for frequently accessed “hot” files that need higher availability. Basic syntax hdfs dfs -setrep [-R] [-w] <numReplicas> <path> Setting…

Read More Adjusting HDFS Replication Factor Per File
Linux & Systems Administration

High Availability in Distributed Storage: RDA Fundamentals
ByWeiwei Jia Mar 24, 2018Apr 11, 2026

Understanding the difference between reliability, durability, and availability is critical when designing or operating distributed storage systems like HDFS, Ceph, or cloud object storage. These terms are often confused, but they address distinct concerns. Durability Durability answers: If the system fails completely, will my data survive? Durability is about persistence to non-volatile storage. Data is…

Read More High Availability in Distributed Storage: RDA Fundamentals
Linux & Systems Administration

Configuring Heap Size for Hadoop NameNode, DataNode, and YARN
ByEric Ma Mar 24, 2018Apr 13, 2026

When running Hadoop on systems with substantial memory, the default 1GB heap size is often inadequate. If you check running processes with ps aux and see -Xmx1000m, you’re working with the default configuration that doesn’t scale to modern hardware. Understanding Hadoop Heap Configuration Hadoop’s Java process memory is controlled by environment variables set in configuration…

Read More Configuring Heap Size for Hadoop NameNode, DataNode, and YARN
Linux & Systems Administration

Configuring Mappers and Reducers in Hadoop: CLI and Code Approaches
ByEric Ma Mar 24, 2018Apr 13, 2026

To set the number of mappers and reducers when submitting a Hadoop job, use the -D flag with the appropriate property names. The correct properties depend on your Hadoop version. Hadoop 2.x and Later (YARN) Use the modern property names: hadoop jar -Dmapreduce.job.maps=5 -Dmapreduce.job.reduces=2 yourapp.jar The older mapred.map.tasks and mapred.reduce.tasks properties are deprecated in Hadoop…

Read More Configuring Mappers and Reducers in Hadoop: CLI and Code Approaches
Scripting & Utilities

Configuring HDFS Replication: Cluster and Per-File Settings
ByEric Ma Mar 24, 2018Apr 12, 2026

The replication factor determines how many copies of each data block HDFS maintains across your cluster. The default is 3, which provides adequate fault tolerance for most deployments by surviving both node and rack failures simultaneously. Setting Cluster-Wide Default Replication To set the default replication factor for all new files, add the dfs.replication property to…

Read More Configuring HDFS Replication: Cluster and Per-File Settings
Design Patterns & Architecture

Understanding Hadoop Configuration Files: Locations and Defaults
ByEric Ma Mar 24, 2018Apr 13, 2026

Hadoop uses three primary configuration files to define YARN, HDFS, and MapReduce behavior: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml These files live in $HADOOP_HOME/etc/hadoop/ and override the built-in defaults when present. Finding Official Default Values Apache publishes default configuration documentation for each release. For current versions: Hadoop 3.4.x (Latest) HDFS defaults: https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN defaults: https://hadoop.apache.org/docs/r3.4.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml…

Read More Understanding Hadoop Configuration Files: Locations and Defaults
Development Best Practices

Understanding YARN: Resource Management and Cluster Fundamentals
ByEric Ma Mar 24, 2018Apr 12, 2026

YARN (Yet Another Resource Negotiator) fundamentally restructured Hadoop 2.0 by decoupling resource management from application logic. If you’re transitioning from Hadoop 1.x or building systems on top of YARN, understanding its architecture is essential for effective cluster administration and application development. Essential Reading The foundational paper Start with “Apache Hadoop YARN: Yet Another Resource Negotiator”…

Read More Understanding YARN: Resource Management and Cluster Fundamentals
Design Patterns & Architecture

Configuring Hadoop Classpath for MapReduce Compilation
ByEric Ma Mar 24, 2018Apr 13, 2026

When compiling MapReduce jobs against a Hadoop installation, you need to include the correct classpath to resolve Hadoop dependencies. The yarn classpath command handles this automatically. Getting the classpath Run this command to output the full classpath: yarn classpath If yarn isn’t in your $PATH, use the full path: $HADOOP_HOME/bin/yarn classpath Replace $HADOOP_HOME with your…

Read More Configuring Hadoop Classpath for MapReduce Compilation
Development Best Practices

Safely Disabling FirewallD on Fedora
ByEric Ma Mar 24, 2018Apr 13, 2026

Fedora uses FirewallD as its default firewall management service. If you’re running servers in a trusted internal cluster or isolated environment where network-level filtering isn’t needed, you can disable it entirely. Stop and Disable FirewallD To completely disable the firewall: sudo systemctl stop firewalld sudo systemctl disable firewalld The stop command halts the service immediately….

Read More Safely Disabling FirewallD on Fedora