Reformer: The Efficient Transformer

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($LL$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

BibTeX key: kitaev2020reformer
entry type: misc
year: 2020
url: http://arxiv.org/abs/2001.04451
note: cite arxiv:2001.04451Comment: ICLR 2020

Users

Comments and Reviewsshow / hide

@jonaskaiser 4 years ago (last updated 4 years ago)
In Ausarbeitung genutzt, da der Reformer die möglichen Anwendungsgebiete des Transformers erweitert und so gut zu Fragestellung, wie allgemein der Attention-Mechanismus anwendbar ist, passt.
References
Bookmarks
deleting review

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy

Reformer: The Efficient Transformer

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on