S. Aroca-Ouellette, and F. Rudzicz. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 4970--4981. Online, Association for Computational Linguistics, (November 2020)
Abstract
BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.
%0 Conference Paper
%1 aroca-ouellette-rudzicz-2020-losses
%A Aroca-Ouellette, Stephane
%A Rudzicz, Frank
%B Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
%C Online
%D 2020
%I Association for Computational Linguistics
%K bert emnlp2020 pretraining tasks
%P 4970--4981
%T On Losses for Modern Language Models.
%U https://www.aclweb.org/anthology/2020.emnlp-main.403
%X BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.
@inproceedings{aroca-ouellette-rudzicz-2020-losses,
abstract = {BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP{'}s effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks {--} sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant {--} that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.},
added-at = {2020-11-23T17:35:10.000+0100},
address = {Online},
author = {Aroca-Ouellette, Stephane and Rudzicz, Frank},
biburl = {https://www.bibsonomy.org/bibtex/278499804086226d541dd3987cc4ce6af/albinzehe},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
interhash = {2c16b7bc471d8408e04563bc76650783},
intrahash = {78499804086226d541dd3987cc4ce6af},
keywords = {bert emnlp2020 pretraining tasks},
month = nov,
pages = {4970--4981},
publisher = {Association for Computational Linguistics},
timestamp = {2020-11-23T17:35:10.000+0100},
title = {On Losses for Modern Language Models.},
url = {https://www.aclweb.org/anthology/2020.emnlp-main.403},
year = 2020
}