PhD thesis,

Nearest Neighbor Search: the Old, the New, and the Impossible

A. Andoni.
(2009)

Abstract

Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network trac statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we eectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme eciency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can nd eciently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classication method in machine learning: to give a label for a new object, nd the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, nding duplicate les and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets. Our contribution is two-fold: we sig- nicantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and simi- larity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms.

BibTeX key: and09
entry type: phdthesis
year: 2009

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Thesis %1 and09 %A Andoni, Alexandr %D 2009 %K master nearest neighbor toVerify %T Nearest Neighbor Search: the Old, the New, and the Impossible %X Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network trac statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we eectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme eciency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can nd eciently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classication method in machine learning: to give a label for a new object, nd the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, nding duplicate les and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets. Our contribution is two-fold: we sig- nicantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and simi- larity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms.

@phdthesis{and09, abstract = {Over the last decade, an immense amount of data has become available. From collections of photos, to genetic data, and to network trac statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we eectively use all this data? The ever growing sizes of the datasets make it imperative to design new algorithms capable of sifting through this data with extreme eciency. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. In the NN problem, the goal is to preprocess a set of objects, so that later, given a query object, one can nd eciently the data object most similar to the query. This problem has a broad set of applications in data processing and analysis. For instance, it forms the basis of a widely used classication method in machine learning: to give a label for a new object, nd the most similar labeled object and copy its label. Other applications include information retrieval, searching image databases, nding duplicate les and web pages, vector quantization, and many others. To represent the objects and the similarity measures, one often uses geometric notions. For example, a black-and-white image may be modeled by a high-dimensional vector, with one coordinate per pixel, whereas the similarity measure may be the standard Euclidean distance between the resulting vectors. Many other, more elaborate ways of representing objects by high-dimensional feature vectors have been studied. In this thesis, we study the NN problem, as well as other related problems that occur frequently when dealing with the massive datasets. Our contribution is two-fold: we sig- nicantly improve the algorithms within the classical approaches to NN, as well as propose new approaches where the classical ones fail. We focus on several key distances and simi- larity measures, including the Euclidean distance, string edit distance and the Earth-Mover Distance (a popular method for comparing images). We also give a number of impossibility results, pointing out the limits of the NN algorithms. }, added-at = {2010-05-10T21:13:19.000+0200}, author = {Andoni, Alexandr}, biburl = {https://www.bibsonomy.org/bibtex/2242a1704e9a88a436b5fdbfd55a6c87a/ans}, interhash = {be9d4437a63b5c5f85522a00a90626ec}, intrahash = {242a1704e9a88a436b5fdbfd55a6c87a}, keywords = {master nearest neighbor toVerify}, timestamp = {2011-03-22T23:02:16.000+0100}, title = {Nearest Neighbor Search: the Old, the New, and the Impossible }, year = 2009 }

BibSonomy

Nearest Neighbor Search: the Old, the New, and the Impossible

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on