Parallel I/O continues to be a topic of active development. Recent years have seen the creation of many new options. Even with these new choices, certain factors remain constant. Parallel applications need a fast I/O subsystem.
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:
* Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing framework.
* HBase builds on Hadoop Core to provide a scalable, distributed database.
* Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
* ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
* Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets.
AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers a client-server architecture for federated file sharing and replicated r
Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
S. Ghemawat, H. Gobioff, und S. Leung. Proceedings of the nineteenth ACM symposium on Operating systems principles, Seite 29--43. New York, NY, USA, ACM, (2003)