copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Tracking evolving communities in large linked networks

J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Proceedings of the National Academy of Sciences, (April 2004)

Abstract

We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.

Links and resources

BibTeX key: citeulike:370723
entry type: article
year: 2004
month: April
chapter: 1
journal: Proceedings of the National Academy of Sciences
pages: 5249--5253
volume: 101
comment: DATA: Citeseer TASK: find trends in communities, find new evolving communities METHOD: they repeatedly use a "standard centroid-based agglomerative clustering technique based on cosine similarity" at different points in the timeline to find a set of communities that occur in the previous clustering (based on a stability measure). these are called natural communities. They use the cosine of the "reference vectors" as the similarity measure. They also find "(future) co-citations" interesting, but this will need some timeline gap which is not applicable, since they want to reveal evolving communities as soon as possible. --- " Natural Communities. We define natural communities as follows. We fix an input data perturbation value of 5%. Then we produce a series of subgraphs G1, G2, . . . , Gn of the original network G (the CiteSeer citation graph) where each Gi is the subgraph of G induced by a random subset of 95% of the core vertices of G. Our clustering algorithm then produces a set of clustering trees TT1, T2, . . . , Tn. We choose the first tree T1 as our base tree. We now define a natural community or cluster as follows. Definition 3.1. A community C in base tree T1 is natural iff in a fraction f of the clustering trees in T the best-match of C has a value greater than p, a predefined threshold. The definition has two parameters: f, the fraction of trees out of n trees total, and p, a lower-bound for the best match. Depending on what values one chooses for these parameters, one obtains more or less well defined natural communities. In practice, we set these values sufficiently high to select clusters that are clearly different from the average cluster in the tree. "
citeulike-article-id: 370723
priority: 0
url: http://www.pnas.org/cgi/content/full/101/suppl_1/5249

@ldietz's tags highlighted

Cite this publication

@article{citeulike:370723, abstract = {We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.}, added-at = {2006-06-16T10:34:37.000+0200}, author = {Hopcroft, John and Khan, Omar and Kulis, Brian and Selman, Bart}, biburl = {https://www.bibsonomy.org/bibtex/28dc7fa75df5cef4413a1766d530c300f/ldietz}, chapter = 1, citeulike-article-id = {370723}, comment = {DATA: Citeseer TASK: find trends in communities, find new evolving communities METHOD: they repeatedly use a "standard centroid-based agglomerative clustering technique based on cosine similarity" at different points in the timeline to find a set of communities that occur in the previous clustering (based on a stability measure). these are called natural communities. They use the cosine of the "reference vectors" as the similarity measure. They also find "(future) co-citations" interesting, but this will need some timeline gap which is not applicable, since they want to reveal evolving communities as soon as possible. --- " Natural Communities. We define natural communities as follows. We fix an input data perturbation value of 5%. Then we produce a series of subgraphs G1, G2, . . . , Gn of the original network G (the CiteSeer citation graph) where each Gi is the subgraph of G induced by a random subset of 95% of the core vertices of G. Our clustering algorithm then produces a set of clustering trees T{T1, T2, . . . , Tn}. We choose the first tree T1 as our base tree. We now define a natural community or cluster as follows. Definition 3.1. A community C in base tree T1 is natural iff in a fraction f of the clustering trees in T the best-match of C has a value greater than p, a predefined threshold. The definition has two parameters: f, the fraction of trees out of n trees total, and p, a lower-bound for the best match. Depending on what values one chooses for these parameters, one obtains more or less well defined natural communities. In practice, we set these values sufficiently high to select clusters that are clearly different from the average cluster in the tree. "}, interhash = {b9c04b5e203b149dfeaa65703b4d8eaa}, intrahash = {8dc7fa75df5cef4413a1766d530c300f}, journal = {Proceedings of the National Academy of Sciences}, keywords = {socialnets community}, month = {April}, pages = {5249--5253}, priority = {0}, timestamp = {2006-06-16T10:34:37.000+0200}, title = {Tracking evolving communities in large linked networks}, url = {http://www.pnas.org/cgi/content/full/101/suppl_1/5249}, volume = 101, year = 2004 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Tracking evolving communities in large linked networks

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Tracking evolving communities in large linked networks

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Tracking evolving communities in large linked networks

Comments and Reviews
(0)