We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.
DATA: Citeseer
TASK: find trends in communities, find new evolving communities
METHOD: they repeatedly use a "standard centroid-based agglomerative clustering technique based on cosine similarity" at different points in the timeline to find a set of communities that occur in the previous clustering (based on a stability measure). these are called natural communities.
They use the cosine of the "reference vectors" as the similarity measure. They also find "(future) co-citations" interesting, but this will need some timeline gap which is not applicable, since they want to reveal evolving communities as soon as possible.
---
"
Natural Communities. We define natural communities as follows.
We fix an input data perturbation value of 5%. Then we produce
a series of subgraphs G1, G2, . . . , Gn of the original network G
(the CiteSeer citation graph) where each Gi is the subgraph of
G induced by a random subset of 95% of the core vertices of G.
Our clustering algorithm then produces a set of clustering trees
TT1, T2, . . . , Tn. We choose the first tree T1 as our base tree.
We now define a natural community or cluster as follows.
Definition 3.1. A community C in base tree T1 is natural iff in a
fraction f of the clustering trees in T the best-match of C has a
value greater than p, a predefined threshold.
The definition has two parameters: f, the fraction of trees out
of n trees total, and p, a lower-bound for the best match.
Depending on what values one chooses for these parameters, one
obtains more or less well defined natural communities. In
practice, we set these values sufficiently high to select clusters
that are clearly different from the average cluster in the tree.
"
%0 Journal Article
%1 citeulike:370723
%A Hopcroft, John
%A Khan, Omar
%A Kulis, Brian
%A Selman, Bart
%D 2004
%J Proceedings of the National Academy of Sciences
%K socialnets community
%P 5249--5253
%T Tracking evolving communities in large linked networks
%U http://www.pnas.org/cgi/content/full/101/suppl_1/5249
%V 101
%X We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.
%& 1
@article{citeulike:370723,
abstract = {We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data.},
added-at = {2006-06-16T10:34:37.000+0200},
author = {Hopcroft, John and Khan, Omar and Kulis, Brian and Selman, Bart},
biburl = {https://www.bibsonomy.org/bibtex/28dc7fa75df5cef4413a1766d530c300f/ldietz},
chapter = 1,
citeulike-article-id = {370723},
comment = {DATA: Citeseer
TASK: find trends in communities, find new evolving communities
METHOD: they repeatedly use a "standard centroid-based agglomerative clustering technique based on cosine similarity" at different points in the timeline to find a set of communities that occur in the previous clustering (based on a stability measure). these are called natural communities.
They use the cosine of the "reference vectors" as the similarity measure. They also find "(future) co-citations" interesting, but this will need some timeline gap which is not applicable, since they want to reveal evolving communities as soon as possible.
---
"
Natural Communities. We define natural communities as follows.
We fix an input data perturbation value of 5%. Then we produce
a series of subgraphs G1, G2, . . . , Gn of the original network G
(the CiteSeer citation graph) where each Gi is the subgraph of
G induced by a random subset of 95% of the core vertices of G.
Our clustering algorithm then produces a set of clustering trees
T{T1, T2, . . . , Tn}. We choose the first tree T1 as our base tree.
We now define a natural community or cluster as follows.
Definition 3.1. A community C in base tree T1 is natural iff in a
fraction f of the clustering trees in T the best-match of C has a
value greater than p, a predefined threshold.
The definition has two parameters: f, the fraction of trees out
of n trees total, and p, a lower-bound for the best match.
Depending on what values one chooses for these parameters, one
obtains more or less well defined natural communities. In
practice, we set these values sufficiently high to select clusters
that are clearly different from the average cluster in the tree.
"},
interhash = {b9c04b5e203b149dfeaa65703b4d8eaa},
intrahash = {8dc7fa75df5cef4413a1766d530c300f},
journal = {Proceedings of the National Academy of Sciences},
keywords = {socialnets community},
month = {April},
pages = {5249--5253},
priority = {0},
timestamp = {2006-06-16T10:34:37.000+0200},
title = {Tracking evolving communities in large linked networks},
url = {http://www.pnas.org/cgi/content/full/101/suppl_1/5249},
volume = 101,
year = 2004
}