Abstract
Open category detection is the problem of detecting älien" test instances
that belong to categories or classes that were not present in the training
data. In many applications, reliably detecting such aliens is central to
ensuring the safety and accuracy of test set predictions. Unfortunately, there
are no algorithms that provide theoretical guarantees on their ability to
detect aliens under general assumptions. Further, while there are algorithms
for open category detection, there are few empirical results that directly
report alien detection rates. Thus, there are significant theoretical and
empirical gaps in our understanding of open category detection. In this paper,
we take a step toward addressing this gap by studying a simple, but
practically-relevant variant of open category detection. In our setting, we are
provided with a "clean" training set that contains only the target categories
of interest and an unlabeled "contaminated" training set that contains a
fraction $\alpha$ of alien examples. Under the assumption that we know an upper
bound on $\alpha$, we develop an algorithm with PAC-style guarantees on the
alien detection rate, while aiming to minimize false alarms. Empirical results
on synthetic and standard benchmark datasets demonstrate the regimes in which
the algorithm can be effective and provide a baseline for further advancements.
Users
Please
log in to take part in the discussion (add own reviews or comments).