Schnell, robust, einfach zu nutzen, skalierbar, weit einsetzbar und inklusive Monitoring: Das verspricht MapReduce, ein Framework von Google zur nebenläufigen Berechnung sehr großer Datenmengen auf Rechnerclustern. Ein mutiges Versprechen. Dieser Artikel wird zeigen, ob MapReduce es einlöst.
Data analytics is becoming increasingly prominent in a variety
of application areas ranging from extracting business intelligence
to processing data from scientific studies. MapReduce
programming paradigm lends itself well to these data-intensive
analytics jobs, given its ability to scale-out and leverage several
machines to parallely process data. In this work we argue
that such MapReduce-based analytics are particularly synergistic
with the pay-as-you-go model of a cloud platform. However,
a key challenge facing end-users in this environment is
the ability to provision MapReduce applications to minimize
the incurred cost, while obtaining the best performance. This
paper firstmotivates the importance of optimally provisioning a
MapReduce job, and demonstrates that existing approaches can
result in far from optimal provisioning. We then present a preliminary
approach that improves MapReduce provisioning by
analyzing and comparing resource consumption of the application
at hand with a database of similar resource consumption
signatures of other applications.
O. Görlitz, S. Sizov, and S. Staab. Proceedings of the Seventh International Workshop on Peer-to-Peer Systems, IPTPS08, Tampa Bay, USA, (February 2008)
D. Lin. ICML '98: Proceedings of the Fifteenth International Conference on Machine Learning, page 296--304. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., (1998)
D. Newman, J. Lau, K. Grieser, and T. Baldwin. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, page 100--108. Los Angeles, California, Association for Computational Linguistics, (June 2010)