Abstract

With the rapid growth of information technology and in many business applications, mining frequent patterns and finding associations among them requires handling large and distributed databases. As FP-tree considered being the best compact data structure to hold the data patterns in memory there has been efforts to make it parallel and distributed to handle large databases. However, it incurs lot of communication over head during the mining. In this paper parallel and distributed frequent pattern mining algorithm using Hadoop Map Reduce framework is proposed, which shows best performance results for large databases. Proposed algorithm partitions the database in such a way that, it works independently at each local node and locally generates the frequent patterns by sharing the global frequent pattern header table. These local frequent patterns are merged at final stage. This reduces the complete communication overhead during structure construction as well as during pattern mining. The item set count is also taken into consideration reducing processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which shows execution time efficiency as compared to other algorithms. The experimental result shows that proposed algorithm efficiently handles the scalability for very large datab ases.

Links and resources

Tags