Standard test sets for supervised learning evaluate in-distribution
generalization. Unfortunately, when a dataset has systematic gaps (e.g.,
annotation artifacts), these evaluations are misleading: a model can learn
simple decision rules that perform well on the test set but do not capture a
dataset's intended capabilities. We propose a new annotation paradigm for NLP
that helps to close systematic gaps in the test data. In particular, after a
dataset is constructed, we recommend that the dataset authors manually perturb
the test instances in small but meaningful ways that (typically) change the
gold label, creating contrast sets. Contrast sets provide a local view of a
model's decision boundary, which can be used to more accurately evaluate a
model's true linguistic capabilities. We demonstrate the efficacy of contrast
sets by creating them for 10 diverse NLP datasets (e.g., DROP reading
comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets
are not explicitly adversarial, model performance is significantly lower on
them than on the original test sets---up to 25\% in some cases. We release our
contrast sets as new evaluation benchmarks and encourage future dataset
construction efforts to follow similar annotation processes.
Description
[2004.02709] Evaluating Models' Local Decision Boundaries via Contrast Sets
%0 Generic
%1 gardner2020evaluating
%A Gardner, Matt
%A Artzi, Yoav
%A Basmova, Victoria
%A Berant, Jonathan
%A Bogin, Ben
%A Chen, Sihao
%A Dasigi, Pradeep
%A Dua, Dheeru
%A Elazar, Yanai
%A Gottumukkala, Ananth
%A Gupta, Nitish
%A Hajishirzi, Hanna
%A Ilharco, Gabriel
%A Khashabi, Daniel
%A Lin, Kevin
%A Liu, Jiangming
%A Liu, Nelson F.
%A Mulcaire, Phoebe
%A Ning, Qiang
%A Singh, Sameer
%A Smith, Noah A.
%A Subramanian, Sanjay
%A Tsarfaty, Reut
%A Wallace, Eric
%A Zhang, Ally
%A Zhou, Ben
%D 2020
%K contrast dataset evaluation instituteclustering
%T Evaluating Models' Local Decision Boundaries via Contrast Sets
%U http://arxiv.org/abs/2004.02709
%X Standard test sets for supervised learning evaluate in-distribution
generalization. Unfortunately, when a dataset has systematic gaps (e.g.,
annotation artifacts), these evaluations are misleading: a model can learn
simple decision rules that perform well on the test set but do not capture a
dataset's intended capabilities. We propose a new annotation paradigm for NLP
that helps to close systematic gaps in the test data. In particular, after a
dataset is constructed, we recommend that the dataset authors manually perturb
the test instances in small but meaningful ways that (typically) change the
gold label, creating contrast sets. Contrast sets provide a local view of a
model's decision boundary, which can be used to more accurately evaluate a
model's true linguistic capabilities. We demonstrate the efficacy of contrast
sets by creating them for 10 diverse NLP datasets (e.g., DROP reading
comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets
are not explicitly adversarial, model performance is significantly lower on
them than on the original test sets---up to 25\% in some cases. We release our
contrast sets as new evaluation benchmarks and encourage future dataset
construction efforts to follow similar annotation processes.
@misc{gardner2020evaluating,
abstract = {Standard test sets for supervised learning evaluate in-distribution
generalization. Unfortunately, when a dataset has systematic gaps (e.g.,
annotation artifacts), these evaluations are misleading: a model can learn
simple decision rules that perform well on the test set but do not capture a
dataset's intended capabilities. We propose a new annotation paradigm for NLP
that helps to close systematic gaps in the test data. In particular, after a
dataset is constructed, we recommend that the dataset authors manually perturb
the test instances in small but meaningful ways that (typically) change the
gold label, creating contrast sets. Contrast sets provide a local view of a
model's decision boundary, which can be used to more accurately evaluate a
model's true linguistic capabilities. We demonstrate the efficacy of contrast
sets by creating them for 10 diverse NLP datasets (e.g., DROP reading
comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets
are not explicitly adversarial, model performance is significantly lower on
them than on the original test sets---up to 25\% in some cases. We release our
contrast sets as new evaluation benchmarks and encourage future dataset
construction efforts to follow similar annotation processes.},
added-at = {2021-01-19T10:24:26.000+0100},
author = {Gardner, Matt and Artzi, Yoav and Basmova, Victoria and Berant, Jonathan and Bogin, Ben and Chen, Sihao and Dasigi, Pradeep and Dua, Dheeru and Elazar, Yanai and Gottumukkala, Ananth and Gupta, Nitish and Hajishirzi, Hanna and Ilharco, Gabriel and Khashabi, Daniel and Lin, Kevin and Liu, Jiangming and Liu, Nelson F. and Mulcaire, Phoebe and Ning, Qiang and Singh, Sameer and Smith, Noah A. and Subramanian, Sanjay and Tsarfaty, Reut and Wallace, Eric and Zhang, Ally and Zhou, Ben},
biburl = {https://www.bibsonomy.org/bibtex/2799c02ddfdabb7b32c4c863b5a7831c7/parismic},
description = {[2004.02709] Evaluating Models' Local Decision Boundaries via Contrast Sets},
interhash = {9c591a99eee0df62a3fd2cd5262754ad},
intrahash = {799c02ddfdabb7b32c4c863b5a7831c7},
keywords = {contrast dataset evaluation instituteclustering},
note = {cite arxiv:2004.02709},
timestamp = {2021-01-19T10:24:26.000+0100},
title = {Evaluating Models' Local Decision Boundaries via Contrast Sets},
url = {http://arxiv.org/abs/2004.02709},
year = 2020
}