Author: Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Web

Moved back to WordPress from MediaWiki
ByEric Ma Jul 12, 2013Sep 5, 2020

WordPress is so missed for many great features and plugins. Hence, we moved the site back to the WordPress system on Jul. 12, 2013. MediaWiki is great but, for this site, WordPress is a better solution. The so missed features of WordPress Related posts via the YARPP Plugin. URLs without strongly mapped to the title…

Read More Moved back to WordPress from MediaWiki
Linux | Tutorial

How to Run a cron Job Every Two Weeks / Months / Days
ByEric Ma Mar 25, 2013Dec 29, 2019

We may want to run some jobs for every two weeks/months/days… under some situation such as backing up for every other week. In addition, we may add more complex rules for running jobs, e.g. run a command when the load of the server is higher than a certain level. With the help of the shell…

Read More How to Run a cron Job Every Two Weeks / Months / Days
Linux | Tutorial

How to Compress/Uncompress Files in Linux Using gzip, bzip2, 7z, rar and zip
ByEric Ma Mar 25, 2013Aug 23, 2020

Compress/uncompress files are frequent operations. The normal tools for compressing/uncompressing in Linux is gzip, bzip2, 7z, rar and zip. This post introduces how to compress and uncompress file in Linux using these tools. We use best compressing rate with all these tools and mark the options for “best rate” in bold fonts. We can delete…

Read More How to Compress/Uncompress Files in Linux Using gzip, bzip2, 7z, rar and zip
Insights | Systems

Storage Architecture and Challenges by Andrew Fikes at Google Faculty Summit 2010
ByEric Ma Jan 22, 2013Aug 30, 2020

Storage Architecture and Challenges in Faculty Summit, July 29, 2010, by Andrew Fikes, Principal Engineer. Download PDF (from archive.org). This slides introduces some of Google’s storage systems with insights and discussion of problems.

Read More Storage Architecture and Challenges by Andrew Fikes at Google Faculty Summit 2010
Insights | Systems

Designs, Lessons and Advice from Building Large Distributed Systems
ByEric Ma Jan 22, 2013Aug 30, 2020

Designs, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean. Everyone who is interested in large distributed systems should read: PDF for Designs, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean.

Read More Designs, Lessons and Advice from Building Large Distributed Systems
Computing systems | News

PUMA: A MapReduce Benchmark Suite
ByEric Ma Dec 20, 2012Sep 5, 2020

MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite…

Read More PUMA: A MapReduce Benchmark Suite
Tutorial

Hadoop TeraSort Benchmark
ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running…

Read More Hadoop TeraSort Benchmark
Computing systems | Storage systems

Large-scale Data Storage and Processing System in Datacenters
ByEric Ma Dec 11, 2012Aug 30, 2020

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)…

Read More Large-scale Data Storage and Processing System in Datacenters
Computing systems | Resource management | Storage systems

Microsofts Cosmos Service
ByEric Ma Dec 10, 2012May 31, 2020

Cosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth…

Read More Microsofts Cosmos Service
Storage systems | Systems

Colossus: Successor to the Google File System (GFS)
ByEric Ma Nov 29, 2012Aug 2, 2020

Colossus is the successor to the Google File System (GFS) as mentioned in the paper on Spanner at OSDI 2012. Colossus is also used by spanner to store its tablets. The information about Colossus is slim compared with GFS which is published in the paper at SOSP 2003. There is still some information about Colossus…

Read More Colossus: Successor to the Google File System (GFS)
News

Conference Ranking by Average Number of Citations in the Last 5 Years, 2012
ByEric Ma Oct 24, 2012

I am trying to find out the top conferences that have the largest average number of citations in the last 5 years on the Internet but fail to find one. However, there are many rankings about the overall citations and numbers of publications. Hence, it is not hard to calculate the average number of citations…

Read More Conference Ranking by Average Number of Citations in the Last 5 Years, 2012
Computing systems | Storage systems | Systems

Hadoop Installation Tutorial (Hadoop 1.x)
ByEric Ma Oct 9, 2012Nov 28, 2020

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed…

Read More Hadoop Installation Tutorial (Hadoop 1.x)
Tutorial

Reading List for Distributed Systems and Cloud Computing
ByEric Ma Sep 15, 2012Aug 30, 2020

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading…

Read More Reading List for Distributed Systems and Cloud Computing
News

Conferences on Cloud Computing 2013
ByEric Ma Sep 1, 2012

This post lists important conferences related to Cloud Computing in year 2013. SOSP 2013 SOSP’13: The 24th ACM Symposium on Operating Systems Principles. November 3-6, 2013, Nemacolin Woodlands Resort, Pennsylvania. The biennial ACM Symposium on Operating Systems Principles is the world’s premier forum for researchers, developers, programmers, and teachers of computer systems technology. Academic and…

Read More Conferences on Cloud Computing 2013
Linux

Managing Repositories on Git Server Using Gitosis
ByEric Ma Mar 25, 2012Aug 23, 2020

How to manage users and repositories and how to use these repositories will be introduced in this post. Please refer to Setting Up a Git Server Using Gitosis for how to set up the git server. Please refer to Howto for New Git Users for how to use git as a new user. Create a…

Read More Managing Repositories on Git Server Using Gitosis
Linux | Tutorial

Setting Up a Git Server Using Gitosis
ByEric Ma Feb 25, 2012Sep 26, 2014

Update: Since gitosis is not maintained and supported, please check out gitolite for setting up a new git server. (see the comment from Sitaram Chamarty, the gitolite author, the author of gitolite.) Gitosis is a piece of software writen by Tommi Virtanen for hosting git repositories. It manages multiple repositories under the same user account….

Read More Setting Up a Git Server Using Gitosis
Tutorial

Hadoop Default Ports
ByEric Ma Jan 15, 2012Mar 27, 2018

Hadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the…

Read More Hadoop Default Ports
Tutorial

A Simple Sort Benchmark on Hadoop
ByEric Ma Jan 7, 2012Apr 5, 2016

After [[hadoop-installation-tutorial|installing Hadoop]], we usually run some benchmark programs to test whether the system works well. In the post of the Hadoop install tutorial, we show a very simple to grep strings from a simple sets of files. In this post, we introduce the Sort for testing and benchmarking Hadoop. The Sort program is also…

Read More A Simple Sort Benchmark on Hadoop
News

Conferences on Cloud Computing 2012
ByEric Ma May 11, 2011Mar 27, 2018

This post lists important conferences on Cloud Computing in year 2012. OSDI 2012 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) October 8–10, 2012, Hollywood, CA “The tenth OSDI seeks to present innovative, exciting research in computer systems. OSDI brings together professionals from academic and industrial backgrounds in what has become a…

Read More Conferences on Cloud Computing 2012
Tutorial

Pitfalls and Lessons on Configuing and Tuning Hadoop
ByEric Ma Apr 26, 2011Mar 27, 2018

This post lists pitfalls and lessons learning when configuring and tuning Hadoop. Hadoop with IPv6 Hadoo doesn’t support IPv6 currently (up to 0.20.2 and 0.21.0): Hadoop and IPv6. The performance of the cluster may suffer from turning IPv6 on in clusters: mail archive. One good practice is to disable IPv6 on servers in the Hadoop…

Read More Pitfalls and Lessons on Configuing and Tuning Hadoop