copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

H. Chang, and A. McCallum. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 8048--8073. Dublin, Ireland, Association for Computational Linguistics, (May 2022)
DOI: 10.18653/v1/2022.acl-long.554

Abstract

Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.

Links and resources

BibTeX key: chang-mccallum-2022-softmax
entry type: inproceedings
address: Dublin, Ireland
booktitle: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
year: 2022
month: may
pages: 8048--8073
publisher: Association for Computational Linguistics
DOI: 10.18653/v1/2022.acl-long.554
url: https://aclanthology.org/2022.acl-long.554

@tobias.koopmann's tags highlighted

Cite this publication

@inproceedings{chang-mccallum-2022-softmax, abstract = {Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.}, added-at = {2024-04-19T09:30:29.000+0200}, address = {Dublin, Ireland}, author = {Chang, Haw-Shiuan and McCallum, Andrew}, biburl = {https://www.bibsonomy.org/bibtex/23ec496ab62f4df8106616fbe3eef23f6/tobias.koopmann}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, doi = {10.18653/v1/2022.acl-long.554}, editor = {Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, interhash = {ce2e301c39be360a0eeec21a6439d24a}, intrahash = {3ec496ab62f4df8106616fbe3eef23f6}, keywords = {llm nlp reading softmax}, month = may, pages = {8048--8073}, publisher = {Association for Computational Linguistics}, timestamp = {2024-04-19T09:30:29.000+0200}, title = {Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions}, url = {https://aclanthology.org/2022.acl-long.554}, year = 2022 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Comments and Reviews
(0)