Abstract
Synthetic graph generators facilitate research in graph algorithms and
processing systems by providing access to data, for instance, graphs resembling
social networks, while circumventing privacy and security concerns.
Nevertheless, their practical value lies in their ability to capture important
metrics of real graphs, such as degree distribution and clustering properties.
Graph generators must also be able to produce such graphs at the scale of
real-world industry graphs, that is, hundreds of billions or trillions of
edges.
In this paper, we propose Darwini, a graph generator that captures a number
of core characteristics of real graphs. Importantly, given a source graph, it
can reproduce the degree distribution and, unlike existing approaches, the
local clustering coefficient and joint-degree distributions. Furthermore,
Darwini maintains metrics such node PageRank, eigenvalues and the K-core
decomposition of a source graph. Comparing Darwini with state-of-the-art
generative models, we show that it can reproduce these characteristics more
accurately. Finally, we provide an open source implementation of our approach
on the vertex-centric Apache Giraph model that allows us to create synthetic
graphs with one trillion edges.
Links and resources
Tags
community