Outlier detection is a fundamental task in data mining and has many
applications including detecting errors in databases. While there has been
extensive prior work on methods for outlier detection, modern datasets often
have sizes that are beyond the ability of commonly used methods to process the
data within a reasonable time. To overcome this issue, outlier detection
methods can be trained over samples of the full-sized dataset. However, it is
not clear how a model trained on a sample compares with one trained on the
entire dataset. In this paper, we introduce the notion of resilience to
sampling for outlier detection methods. Orthogonal to traditional performance
metrics such as precision/recall, resilience represents the extent to which the
outliers detected by a method applied to samples from a sampling scheme matches
those when applied to the whole dataset. We propose a novel approach for
estimating the resilience to sampling of both individual outlier methods and
their ensembles. We performed an extensive experimental study on synthetic and
real-world datasets where we study seven diverse and representative outlier
detection methods, compare results obtained from samples versus those obtained
from the whole datasets and evaluate the accuracy of our resilience estimates.
We observed that the methods are not equally resilient to a given sampling
scheme and it is often the case that careful joint selection of both the
sampling scheme and the outlier detection method is necessary. It is our hope
that the paper initiates research on designing outlier detection algorithms
that are resilient to sampling.
Описание
[1907.13276v1] Are Outlier Detection Methods Resilient to Sampling?
%0 Journal Article
%1 bertiequille2019outlier
%A Berti-Equille, Laure
%A Loh, Ji Meng
%A Thirumuruganathan, Saravanan
%D 2019
%K anomaly-detection outliers sampling
%T Are Outlier Detection Methods Resilient to Sampling?
%U http://arxiv.org/abs/1907.13276
%X Outlier detection is a fundamental task in data mining and has many
applications including detecting errors in databases. While there has been
extensive prior work on methods for outlier detection, modern datasets often
have sizes that are beyond the ability of commonly used methods to process the
data within a reasonable time. To overcome this issue, outlier detection
methods can be trained over samples of the full-sized dataset. However, it is
not clear how a model trained on a sample compares with one trained on the
entire dataset. In this paper, we introduce the notion of resilience to
sampling for outlier detection methods. Orthogonal to traditional performance
metrics such as precision/recall, resilience represents the extent to which the
outliers detected by a method applied to samples from a sampling scheme matches
those when applied to the whole dataset. We propose a novel approach for
estimating the resilience to sampling of both individual outlier methods and
their ensembles. We performed an extensive experimental study on synthetic and
real-world datasets where we study seven diverse and representative outlier
detection methods, compare results obtained from samples versus those obtained
from the whole datasets and evaluate the accuracy of our resilience estimates.
We observed that the methods are not equally resilient to a given sampling
scheme and it is often the case that careful joint selection of both the
sampling scheme and the outlier detection method is necessary. It is our hope
that the paper initiates research on designing outlier detection algorithms
that are resilient to sampling.
@article{bertiequille2019outlier,
abstract = {Outlier detection is a fundamental task in data mining and has many
applications including detecting errors in databases. While there has been
extensive prior work on methods for outlier detection, modern datasets often
have sizes that are beyond the ability of commonly used methods to process the
data within a reasonable time. To overcome this issue, outlier detection
methods can be trained over samples of the full-sized dataset. However, it is
not clear how a model trained on a sample compares with one trained on the
entire dataset. In this paper, we introduce the notion of resilience to
sampling for outlier detection methods. Orthogonal to traditional performance
metrics such as precision/recall, resilience represents the extent to which the
outliers detected by a method applied to samples from a sampling scheme matches
those when applied to the whole dataset. We propose a novel approach for
estimating the resilience to sampling of both individual outlier methods and
their ensembles. We performed an extensive experimental study on synthetic and
real-world datasets where we study seven diverse and representative outlier
detection methods, compare results obtained from samples versus those obtained
from the whole datasets and evaluate the accuracy of our resilience estimates.
We observed that the methods are not equally resilient to a given sampling
scheme and it is often the case that careful joint selection of both the
sampling scheme and the outlier detection method is necessary. It is our hope
that the paper initiates research on designing outlier detection algorithms
that are resilient to sampling.},
added-at = {2019-08-22T20:46:24.000+0200},
author = {Berti-Equille, Laure and Loh, Ji Meng and Thirumuruganathan, Saravanan},
biburl = {https://www.bibsonomy.org/bibtex/225cfaef3b0d4d8f6409e8db13d10396e/kirk86},
description = {[1907.13276v1] Are Outlier Detection Methods Resilient to Sampling?},
interhash = {f8f520b7ce626c052dc8c285a096d5b7},
intrahash = {25cfaef3b0d4d8f6409e8db13d10396e},
keywords = {anomaly-detection outliers sampling},
note = {cite arxiv:1907.13276Comment: 18 pages},
timestamp = {2019-08-22T20:46:24.000+0200},
title = {Are Outlier Detection Methods Resilient to Sampling?},
url = {http://arxiv.org/abs/1907.13276},
year = 2019
}