Article,

Parallel clustering algorithm for large-scale biological data sets.

M. Wang, W. Zhang, W. Ding, D. Dai, H. Zhang, H. Xie, L. Chen, Y. Guo, and J. Xie.
PloS one, (2014)

Abstract

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

BibTeX key: Wang2014Parallel
entry type: article
year: 2014
journal: PloS one
number: 4
volume: 9
citeulike-article-id: 13133143
citeulike-linkout-1: http://www.hubmed.org/display.cgi?uids=24705246
pmid: 24705246
priority: 2
posted-at: 2014-04-11 05:57:46
issn: 1932-6203
citeulike-linkout-0: http://view.ncbi.nlm.nih.gov/pubmed/24705246
url: http://view.ncbi.nlm.nih.gov/pubmed/24705246

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{Wang2014Parallel, abstract = {Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.}, added-at = {2018-12-02T16:09:07.000+0100}, author = {Wang, Minchao and Zhang, Wu and Ding, Wang and Dai, Dongbo and Zhang, Huiran and Xie, Hao and Chen, Luonan and Guo, Yike and Xie, Jiang}, biburl = {https://www.bibsonomy.org/bibtex/29ff3d1b65b9f0c324928fe3cd49b4301/karthikraman}, citeulike-article-id = {13133143}, citeulike-linkout-0 = {http://view.ncbi.nlm.nih.gov/pubmed/24705246}, citeulike-linkout-1 = {http://www.hubmed.org/display.cgi?uids=24705246}, interhash = {3c257196ae9a03fa1eb8b2b24c00af44}, intrahash = {9ff3d1b65b9f0c324928fe3cd49b4301}, issn = {1932-6203}, journal = {PloS one}, keywords = {clustering data-analysis parallel-algorithms}, number = 4, pmid = {24705246}, posted-at = {2014-04-11 05:57:46}, priority = {2}, timestamp = {2018-12-02T16:09:07.000+0100}, title = {Parallel clustering algorithm for large-scale biological data sets.}, url = {http://view.ncbi.nlm.nih.gov/pubmed/24705246}, volume = 9, year = 2014 }

BibSonomy

Parallel clustering algorithm for large-scale biological data sets.

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on