PhD thesis,

Nearest Neighbor Search: the Old, the New, and the Impossible

.
(2009)

Abstract

Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network trac statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we eectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme eciency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can nd eciently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classication method in machine learning: to give a label for a new object, nd the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, nding duplicate les and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets. Our contribution is two-fold: we sig- nicantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and simi- larity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms.

Tags

Users

  • @ans

Comments and Reviews