Large-scale Data Storage and Processing System in Datacenters
Posted on In Computing systems, Storage systemsResearch on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows.
Table of Contents
Storage systems
- Google File System (GFS): http://research.google.com/archive/gfs.html
- HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
- Colossus (GFS2): Colossus: Successor to the Google File System (GFS)
- BigTable: http://research.google.com/archive/bigtable.html
- Megastore: http://research.google.com/pubs/pub36971.html
- Spanner: http://research.google.com/archive/spanner.html
- Dynamo: http://dl.acm.org/citation.cfm?id=1294281
- RAMCloud: http://dl.acm.org/citation.cfm?id=1965751 and http://dl.acm.org/citation.cfm?id=2043560
Compute systems
- MapReduce: http://research.google.com/archive/mapreduce.html
- Hadoop implementation: Hadoop MapReduce Tutorials
- Sawzall: http://research.google.com/archive/sawzall.html
- FlumeJava: http://dl.acm.org/citation.cfm?id=1806638
- Pig latin: http://dl.acm.org/citation.cfm?id=1376726
- Dryad/DryadLINQ: http://research.microsoft.com/en-us/projects/dryad/
- Pregel: http://dl.acm.org/citation.cfm?id=1807184 and http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
- Dremel: http://research.google.com/pubs/pub36632.html
- Storm: https://blog.twitter.com/2011/a-storm-is-coming-more-details-and-plans-for-release and https://github.com/nathanmarz/storm/wiki
- Spark: https://www.usenix.org/conference/nsdi12/resilient-distributed-datasets-fault-tolerant-abstraction-memory-cluster-computing and http://spark-project.org/
- DVM: IEEE Transactions on Computers paper and VEE paper
The Memcache and TAO from Facebook are also very interesting, scalable and real systems: http://www.systutorials.com/qa/364/cache-at-facebook