A. Broder. Compression and Complexity of Sequences, page 21--29. Salerno, Italy, IEEE Computer Society Press, (June 1997)
Abstract
Given two documents A and B we define two mathematical notions: their
resemblance r(A, B) and their containment c(A, B) that seem to capture
well the informal notions of “roughly the same� and “roughly
contained.� The basic idea is to reduce these issues to set intersection
problems that can be easily evaluated by a process of random sampling
that can be done independently for each document. Furthermore, the
resemblance can be evaluated using a fixed size sample for each
document. This paper discusses the mathematical properties of these
measures and the efficient implementation of the sampling process
using Rabin (1981) fingerprints
%0 Conference Paper
%1 Broder1997
%A Broder, Andrei Z.
%B Compression and Complexity of Sequences
%C Salerno, Italy
%D 1997
%I IEEE Computer Society Press
%K detection duplicate resemblance
%P 21--29
%T On the resemblance and containment of documents
%U http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.779&rep=rep1&type=pdf
%X Given two documents A and B we define two mathematical notions: their
resemblance r(A, B) and their containment c(A, B) that seem to capture
well the informal notions of “roughly the same� and “roughly
contained.� The basic idea is to reduce these issues to set intersection
problems that can be easily evaluated by a process of random sampling
that can be done independently for each document. Furthermore, the
resemblance can be evaluated using a fixed size sample for each
document. This paper discusses the mathematical properties of these
measures and the efficient implementation of the sampling process
using Rabin (1981) fingerprints
@inproceedings{Broder1997,
abstract = {Given two documents A and B we define two mathematical notions: their
resemblance r(A, B) and their containment c(A, B) that seem to capture
well the informal notions of “roughly the same� and “roughly
contained.� The basic idea is to reduce these issues to set intersection
problems that can be easily evaluated by a process of random sampling
that can be done independently for each document. Furthermore, the
resemblance can be evaluated using a fixed size sample for each
document. This paper discusses the mathematical properties of these
measures and the efficient implementation of the sampling process
using Rabin (1981) fingerprints},
added-at = {2011-07-07T11:07:43.000+0200},
address = {Salerno, Italy},
author = {Broder, Andrei Z.},
biburl = {https://www.bibsonomy.org/bibtex/2e8d7e47dafc145c54846bb69e1c1be39/stroeh},
booktitle = {Compression and Complexity of Sequences},
citeulike-article-id = {562668},
description = {Not previously uploaded},
interhash = {3e9b05638c537f23a276ef4e09d4b9d4},
intrahash = {e8d7e47dafc145c54846bb69e1c1be39},
keywords = {detection duplicate resemblance},
month = {June},
pages = {21--29},
priority = {3},
publisher = {IEEE Computer Society Press},
timestamp = {2011-07-07T11:07:43.000+0200},
title = {On the resemblance and containment of documents},
url = {http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.779&rep=rep1&type=pdf},
year = 1997
}