Tutorial

Hadoop TeraSort Benchmark

ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

Generate input by TeraGen

The syntax for TeraGen:

$ hadoop jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 \
jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

Run TeraSort

After the data is generated, run the sort by TeraSort

$ hadoop jar hadoop-*examples*.jar terasort \
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate \
<output dir> <terasort-validate dir>

How to search bad blocks on a device?

ByQ A Mar 24, 2018

If I have a disk with bad blocks, how to search bad blocks on it under Linux? You can use ‘badblocks’: https://www.systutorials.com/docs/linux/man/8-badblocks/ badblocks [ -svwnf ] [ -b block-size ] [ -c blocks_at_once ] [ -e max_bad_blocks ] [ -d read_delay_factor ] [ -i input_file ] [ -o output_file ] [ -p num_passes ] […

scanf is dangerous, but what is the reason?

ByQ A Mar 24, 2018Mar 24, 2018

We all know scanf / fscanf / sscanf is dangerous. But why? what is the exact reason? I thought the ‘%s’ is a problem that causes buffer overflow and ‘fgets’ is a better solution. But is it the exact reason? I pased a discussion by AndreyT and his discussion helps me figure it out: Claiming…

Blockchain | Systems | Systems 101 | Tutorial

Private Key Sharding: A Technical Guide

ByEric Ma Sep 14, 2024May 4, 2025

Private key sharding is a technique used to distribute a private key into multiple parts, or “shards,” to enhance security and fault tolerance. This method is particularly useful in scenarios where a single point of failure must be avoided, such as in secure communications, cryptocurrency wallets, and distributed systems. What is Private Key Sharding? Private…

QA | Tutorial

How to generate a pair of SSH private key and public key pairs?

ByQ A Jul 16, 2019Nov 21, 2019

How to generate a pair of SSH private key and public key pairs? On Linux, you can generate one first by $ ssh-keygen -t rsa By default on Linux, the key pair is stored in `~/.ssh` named `id_rsa` and `id_rsa.pub` for the private and public key. Read more: Generating RSA Private and Public Key Pair…

Blockchain | Systems

How to Install Hyperledger Fabric 2.0 in Ubuntu 18.04

ByEric Ma Apr 8, 2020Mar 1, 2021

Hyperledger Fabric is a consortium blockchain system. It’s performance is relatively good and its modular architecture enables it to be usable in many scenarios. Hyperledger Fabric itself has rich documents and samples of test networks. For beginners, deploying a new network for trying and testing still consumes quite some time. In this post, we will…

Web

How to Change the Site’s Default 404 Error Not Found Page

ByEric Ma Jul 13, 2013

The apache’s default “404 Error not found” page seems ugly. And may some hosting service put theire ads in it. We can add some entry in .htaccess to change the defualt 404 error page. This method can also be used for some other error codes. A list of the server returned codes can be found…

One Comment

Eric Zhiqiang Ma says:

Jul 23, 2014 at 6:34 pm

For large datasets, you may need to specify the number of mappers and reducers to make the computation and data distributed across nodes:

https://www.systutorials.com/qa/947/how-set-the-number-mappers-and-reducers-hadoop-command-line

Reply

Generate input by TeraGen

Run TeraSort

Validate the sorted output data of TeraSort

Similar Posts

One Comment

Leave a Reply Cancel reply