Spark

QA

how to skip mapper function in hadoop
ByEric Ma Mar 24, 2018Mar 28, 2018

In hadoop I need to skip mapper function and directly execute the reducer function. We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to…

Read More how to skip mapper function in hadoop
QA

What are the DDL and DML of Shark (Spark SQL)?
ByEric Ma Mar 24, 2018Mar 24, 2018

Currently, I wanna take Shark’s (Spark SQL) DDL and DML as an reference to design/implement SQLE’s DDL and DML. However, I cannot find its DDL and DML. I can only find several SQLs in Shark paper[1]. [1] shark paper – http://tab.d-thinker.org/showthread.php?tid=2585 Shark’s language is Hive QL. HQL’s DDL and DML can be found at Hive…

Read More What are the DDL and DML of Shark (Spark SQL)?
Computing systems | Storage systems | Systems

Big Data Benchmark from AMPLab of UC Berkeley
ByEric Ma Mar 17, 2014Sep 5, 2020

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark The…

Read More Big Data Benchmark from AMPLab of UC Berkeley
Computing systems | Storage systems

Large-scale Data Storage and Processing System in Datacenters
ByEric Ma Dec 11, 2012Aug 30, 2020

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)…

Read More Large-scale Data Storage and Processing System in Datacenters
Tutorial

Reading List for Distributed Systems and Cloud Computing
ByEric Ma Sep 15, 2012Aug 30, 2020

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading…

Read More Reading List for Distributed Systems and Cloud Computing