Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
J. Dean, and S. Ghemawat. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, page 137--149. Berkeley, CA, USA, USENIX Association, (2004)