Compress/uncompress files are frequent operations. The normal tools for compressing/uncompressing in Linux is gzip, bzip2, 7z, rar and zip. This post introduces how to compress and uncompress file in Linux using these tools. We use best compressing rate with all these tools and mark the options for “best rate” in bold fonts. We can delete
Read more
Storage Architecture and Challenges by Andrew Fikes at Google Faculty Summit 2010
Posted onStorage Architecture and Challenges in Faculty Summit, July 29, 2010, by Andrew Fikes, Principal Engineer. Download PDF (from archive.org). This slides introduces some of Google’s storage systems with insights and discussion of problems.
Designs, Lessons and Advice from Building Large Distributed Systems
Posted onDesigns, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean. Everyone who is interested in large distributed systems should read: PDF for Designs, Lessons and Advice from Building Large Distributed Systems by Jeaf Dean.
PUMA: A MapReduce Benchmark Suite
Posted onMapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite
Read more
Hadoop TeraSort Benchmark
Posted onTeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running
Read more
Large-scale Data Storage and Processing System in Datacenters
Posted onResearch on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)
Read more
Microsofts Cosmos Service
Posted onCosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth
Read more
Colossus: Successor to the Google File System (GFS)
Posted onColossus is the successor to the Google File System (GFS) as mentioned in the paper on Spanner at OSDI 2012. Colossus is also used by spanner to store its tablets. The information about Colossus is slim compared with GFS which is published in the paper at SOSP 2003. There is still some information about Colossus
Read more
Conference Ranking by Average Number of Citations in the Last 5 Years, 2012
Posted onI am trying to find out the top conferences that have the largest average number of citations in the last 5 years on the Internet but fail to find one. However, there are many rankings about the overall citations and numbers of publications. Hence, it is not hard to calculate the average number of citations
Read more
Hadoop Installation Tutorial (Hadoop 1.x)
Posted onUpdate: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed
Read more
Reading List for Distributed Systems and Cloud Computing
Posted onUnderstanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading
Read more
Conferences on Cloud Computing 2013
Posted onThis post lists important conferences related to Cloud Computing in year 2013. SOSP 2013 SOSP’13: The 24th ACM Symposium on Operating Systems Principles. November 3-6, 2013, Nemacolin Woodlands Resort, Pennsylvania. The biennial ACM Symposium on Operating Systems Principles is the world’s premier forum for researchers, developers, programmers, and teachers of computer systems technology. Academic and
Read more
Managing Repositories on Git Server Using Gitosis
Posted onHow to manage users and repositories and how to use these repositories will be introduced in this post. Please refer to Setting Up a Git Server Using Gitosis for how to set up the git server. Please refer to Howto for New Git Users for how to use git as a new user. Create a
Read more
Setting Up a Git Server Using Gitosis
Posted onUpdate: Since gitosis is not maintained and supported, please check out gitolite for setting up a new git server. (see the comment from Sitaram Chamarty, the gitolite author, the author of gitolite.) Gitosis is a piece of software writen by Tommi Virtanen for hosting git repositories. It manages multiple repositories under the same user account.
Read more
Hadoop Default Ports
Posted onHadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the
Read more
A Simple Sort Benchmark on Hadoop
Posted onAfter [[hadoop-installation-tutorial|installing Hadoop]], we usually run some benchmark programs to test whether the system works well. In the post of the Hadoop install tutorial, we show a very simple to grep strings from a simple sets of files. In this post, we introduce the Sort for testing and benchmarking Hadoop. The Sort program is also
Read more
Conferences on Cloud Computing 2012
Posted onThis post lists important conferences on Cloud Computing in year 2012. OSDI 2012 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) October 8–10, 2012, Hollywood, CA “The tenth OSDI seeks to present innovative, exciting research in computer systems. OSDI brings together professionals from academic and industrial backgrounds in what has become a
Read more
Pitfalls and Lessons on Configuing and Tuning Hadoop
Posted onThis post lists pitfalls and lessons learning when configuring and tuning Hadoop. Hadoop with IPv6 Hadoo doesn’t support IPv6 currently (up to 0.20.2 and 0.21.0): Hadoop and IPv6. The performance of the cluster may suffer from turning IPv6 on in clusters: mail archive. One good practice is to disable IPv6 on servers in the Hadoop
Read more
Setting Up Standalone (Local) Hadoop
Posted onHadoop is designed to run on [[hadoop-installation-tutorial|hundreds to thousands of computers]] inside cluster. However, Hadoop is configured to run things in a non-distributed mode as a single Java process by default. This is specially useful for debugging since distributed debugging is really a nightmare. This post introduces how to set up a standalone Hadoop environment.
Read more
Conferences on Cloud Computing 2011
Posted onThis post lists important conferences on Cloud Computing in year 2011. ACM Symposium on Cloud Computing October 27 and 28, 2011, Cascais, Portugal Submission Deadline: April 30, 2011 23rd ACM Symposium on Operating Systems Principles (SOSP) October 23-26, 2011, Cascais, Portugal Submission deadline: March 18, 2011, 11:59 PM GMT EuroSys 2011 April 10-13, 2011. Salzburg,
Read more