copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

C. Northcutt, A. Athalye, and J. Mueller. (2021)cite arxiv:2103.14749.

Abstract

We algorithmically identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are found using confident learning and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.

Description

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Links and resources

BibTeX key: northcutt2021pervasive
entry type: misc
year: 2021
url: http://arxiv.org/abs/2103.14749
note: cite arxiv:2103.14749

@stdiff's tags highlighted

neural-network

Cite this publication

@misc{northcutt2021pervasive, abstract = {We algorithmically identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are found using confident learning and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.}, added-at = {2021-09-25T00:50:32.000+0200}, author = {Northcutt, Curtis G. and Athalye, Anish and Mueller, Jonas}, biburl = {https://www.bibsonomy.org/bibtex/26c7f7090e65329dbd5406a9d709cef10/stdiff}, description = {Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks}, interhash = {79ae145ce166b90ca8017db981b7009e}, intrahash = {6c7f7090e65329dbd5406a9d709cef10}, keywords = {neural-network}, note = {cite arxiv:2103.14749}, timestamp = {2021-09-25T00:50:32.000+0200}, title = {Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks}, url = {http://arxiv.org/abs/2103.14749}, year = 2021 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Comments and Reviews
(0)