copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Retrieval-Augmented Multimodal Language Modeling

M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W. Yih. (2022)cite arxiv:2211.12561Comment: Published at ICML 2023. Blog post available at https://cs.stanford.edu/~myasu/blog/racm3/.

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

Description

Retrieval-Augmented Multimodal Language Modeling

Links and resources

BibTeX key: yasunaga2022retrievalaugmented
entry type: misc
year: 2022
url: http://arxiv.org/abs/2211.12561
note: cite arxiv:2211.12561Comment: Published at ICML 2023. Blog post available at https://cs.stanford.edu/~myasu/blog/racm3/

Cite this publication

@misc{yasunaga2022retrievalaugmented, abstract = {Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).}, added-at = {2023-08-17T14:59:40.000+0200}, author = {Yasunaga, Michihiro and Aghajanyan, Armen and Shi, Weijia and James, Rich and Leskovec, Jure and Liang, Percy and Lewis, Mike and Zettlemoyer, Luke and Yih, Wen-tau}, biburl = {https://www.bibsonomy.org/bibtex/2a712665aea8f46197c9d290b338bc7cc/lisa-ee}, description = {Retrieval-Augmented Multimodal Language Modeling}, interhash = {0001ace53e05bdcb6c1367ca436851d5}, intrahash = {a712665aea8f46197c9d290b338bc7cc}, keywords = {llm retrieval}, note = {cite arxiv:2211.12561Comment: Published at ICML 2023. Blog post available at https://cs.stanford.edu/~myasu/blog/racm3/}, timestamp = {2023-08-17T14:59:40.000+0200}, title = {Retrieval-Augmented Multimodal Language Modeling}, url = {http://arxiv.org/abs/2211.12561}, year = 2022 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Retrieval-Augmented Multimodal Language Modeling

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Retrieval-Augmented Multimodal Language Modeling

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Retrieval-Augmented Multimodal Language Modeling

Comments and Reviews
(0)