We algorithmically identify label errors in the test sets of 10 of the most
commonly-used computer vision, natural language, and audio datasets, and
subsequently study the potential for these label errors to affect benchmark
results. Errors in test sets are numerous and widespread: we estimate an
average of 3.4% errors across the 10 datasets, where for example 2916 label
errors comprise 6% of the ImageNet validation set. Putative label errors are
found using confident learning and then human-validated via crowdsourcing (54%
of the algorithmically-flagged candidates are indeed erroneously labeled).
Surprisingly, we find that lower capacity models may be practically more useful
than higher capacity models in real-world datasets with high proportions of
erroneously labeled data. For example, on ImageNet with corrected labels:
ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test
examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11
outperforms VGG-19 if the prevalence of originally mislabeled test examples
increases by 5%. Traditionally, ML practitioners choose which model to deploy
based on test accuracy -- our findings advise caution here, proposing that
judging models over correctly labeled test sets may be more useful, especially
for noisy real-world datasets.
Описание
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
%0 Generic
%1 northcutt2021pervasive
%A Northcutt, Curtis G.
%A Athalye, Anish
%A Mueller, Jonas
%D 2021
%K neural-network
%T Pervasive Label Errors in Test Sets Destabilize Machine Learning
Benchmarks
%U http://arxiv.org/abs/2103.14749
%X We algorithmically identify label errors in the test sets of 10 of the most
commonly-used computer vision, natural language, and audio datasets, and
subsequently study the potential for these label errors to affect benchmark
results. Errors in test sets are numerous and widespread: we estimate an
average of 3.4% errors across the 10 datasets, where for example 2916 label
errors comprise 6% of the ImageNet validation set. Putative label errors are
found using confident learning and then human-validated via crowdsourcing (54%
of the algorithmically-flagged candidates are indeed erroneously labeled).
Surprisingly, we find that lower capacity models may be practically more useful
than higher capacity models in real-world datasets with high proportions of
erroneously labeled data. For example, on ImageNet with corrected labels:
ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test
examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11
outperforms VGG-19 if the prevalence of originally mislabeled test examples
increases by 5%. Traditionally, ML practitioners choose which model to deploy
based on test accuracy -- our findings advise caution here, proposing that
judging models over correctly labeled test sets may be more useful, especially
for noisy real-world datasets.
@misc{northcutt2021pervasive,
abstract = {We algorithmically identify label errors in the test sets of 10 of the most
commonly-used computer vision, natural language, and audio datasets, and
subsequently study the potential for these label errors to affect benchmark
results. Errors in test sets are numerous and widespread: we estimate an
average of 3.4% errors across the 10 datasets, where for example 2916 label
errors comprise 6% of the ImageNet validation set. Putative label errors are
found using confident learning and then human-validated via crowdsourcing (54%
of the algorithmically-flagged candidates are indeed erroneously labeled).
Surprisingly, we find that lower capacity models may be practically more useful
than higher capacity models in real-world datasets with high proportions of
erroneously labeled data. For example, on ImageNet with corrected labels:
ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test
examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11
outperforms VGG-19 if the prevalence of originally mislabeled test examples
increases by 5%. Traditionally, ML practitioners choose which model to deploy
based on test accuracy -- our findings advise caution here, proposing that
judging models over correctly labeled test sets may be more useful, especially
for noisy real-world datasets.},
added-at = {2021-09-25T00:50:32.000+0200},
author = {Northcutt, Curtis G. and Athalye, Anish and Mueller, Jonas},
biburl = {https://www.bibsonomy.org/bibtex/26c7f7090e65329dbd5406a9d709cef10/stdiff},
description = {Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks},
interhash = {79ae145ce166b90ca8017db981b7009e},
intrahash = {6c7f7090e65329dbd5406a9d709cef10},
keywords = {neural-network},
note = {cite arxiv:2103.14749},
timestamp = {2021-09-25T00:50:32.000+0200},
title = {Pervasive Label Errors in Test Sets Destabilize Machine Learning
Benchmarks},
url = {http://arxiv.org/abs/2103.14749},
year = 2021
}