Abstract
Increased use of the Internet and progress in Cloud computing creates a large new datasets with increasing value to business. Data need to be processed by cloud applications are emerging much faster than the computing power. Hadoop-MapReduce has become powerful computation model to address these problems. Nowadays many cloud services require users to share their confidential data like electronic health records for research analysis or data mining, which brings privacy concerns. K-anonymity is one of the widely used privacy model. The scale of data in cloud applications rises extremely in agreement with the Big Data tendency, thereby creating it a dispute for conventional software tools to process such large scale data within an endurable lapsed time. As a consequence, it is a dispute for current anonymization techniques to preserve privacy on confidential extensible data sets due to their inadequacy of scalability. In this project, we propose an extensible two-phase approach to anonymize scalable data sets using dynamic MapReduce framework, Top Down Specialization (TDS) Algorithm and k-Anonymity privacy model. The resources are optimized via three key aspects. First, the under-utilization of map and reduce tasks is improved based on Dynamic Hadoop Slot Allocation (DHSA). Second, the performance tradeoff between the single job and a batch of jobs is balanced using the Speculative Execution Performance Balancing (SEPB). Third, data locality can be improved without any impact on fairness using Slot Pre Scheduling. Experimental evaluation results demonstrate that with this project, the scalability, efficiency and privacy of data sets can be significantly improved over existing approaches.
Users
Please
log in to take part in the discussion (add own reviews or comments).