Abstract
We survey a new area of parameter-free similarity distance measures useful in
data-mining, pattern recognition, learning and automatic semantics extraction.
Given a family of distances on a set of objects, a distance is universal up to
a certain precision for that family if it minorizes every distance in the
family between every two objects in the set, up to the stated precision (we do
not require the universal distance to be an element of the family). We consider
similarity distances for two types of objects: literal objects that as such
contain all of their meaning, like genomes or books, and names for objects.
meaning, like genomes or books, and names for objects. The latter may have
literal embodyments like the first type, but may also be abstract like ``red''
or ``christianity.'' For the first type we consider a family of computable
distance measures corresponding to parameters expressing similarity according
to particular features between pairs of literal objects. For the second type we
consider similarity distances generated by web users corresponding to
particular semantic relations between the (names for) the designated objects.
For both families we give universal similarity distance measures, incorporating
all particular distance measures in the family. In the first case the universal
distance is based on compression and in the second case it is based on Google
page counts related to search terms. In both cases experiments on a massive
scale give evidence of the viability of the approaches.
Users
Please
log in to take part in the discussion (add own reviews or comments).