Abstract
Over the last decade, an immense amount of data has become available. From collections
of photos, to genetic data, and to network trac statistics, modern technologies and cheap
storage have made it possible to accumulate huge datasets. But how can we eectively
use all this data? The ever growing sizes of the datasets make it imperative to design new
algorithms capable of sifting through this data with extreme eciency.
A fundamental computational primitive for dealing with massive dataset is the Nearest
Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so
that later, given a query object, one can nd eciently the data object most similar to the
query. This problem has a broad set of applications in data processing and analysis. For
instance, it forms the basis of a widely used classication method in machine learning: to
give a label for a new object, nd the most similar labeled object and copy its label. Other
applications include information retrieval, searching image databases, nding duplicate les
and web pages, vector quantization, and many others.
To represent the objects and the similarity measures, one often uses geometric notions.
For example, a black-and-white image may be modeled by a high-dimensional vector, with
one coordinate per pixel, whereas the similarity measure may be the standard Euclidean
distance between the resulting vectors. Many other, more elaborate ways of representing
objects by high-dimensional feature vectors have been studied.
In this thesis, we study the NN problem, as well as other related problems that occur
frequently when dealing with the massive datasets. Our contribution is two-fold: we sig-
nicantly improve the algorithms within the classical approaches to NN, as well as propose
new approaches where the classical ones fail. We focus on several key distances and simi-
larity measures, including the Euclidean distance, string edit distance and the Earth-Mover
Distance (a popular method for comparing images). We also give a number of impossibility
results, pointing out the limits of the NN algorithms.
Users
Please
log in to take part in the discussion (add own reviews or comments).