Misc,

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

S. Yadlowsky, L. Doshi, and N. Tripuraneni.
PDF, (November 2023)cite arxiv:2311.00871.
DOI: 10.48550/arXiv.2311.00871

Abstract

Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.

BibTeX key: yadlowsky2023pretraining
entry type: misc
year: 2023
month: nov
howpublished: PDF
language: en-US
DOI: 10.48550/arXiv.2311.00871
url: http://arxiv.org/abs/2311.00871
note: cite arxiv:2311.00871

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@misc{yadlowsky2023pretraining, abstract = {Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.}, added-at = {2023-11-19T18:09:42.000+0100}, author = {Yadlowsky, Steve and Doshi, Lyric and Tripuraneni, Nilesh}, biburl = {https://www.bibsonomy.org/bibtex/235e9bebcb3c0c833fe7548762c6409b8/meneteqel}, doi = {10.48550/arXiv.2311.00871}, howpublished = {PDF}, interhash = {c6c1f7fd154b26e5e2418eab985ec542}, intrahash = {35e9bebcb3c0c833fe7548762c6409b8}, keywords = {artificial_intelligence generative_ai large_language_models transformer_models}, language = {en-US}, month = nov, note = {cite arxiv:2311.00871}, timestamp = {2023-11-19T18:10:06.000+0100}, title = {Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models}, url = {http://arxiv.org/abs/2311.00871}, year = 2023 }

BibSonomy

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on