Abstract
Transformers do not scale very well to long sequence lengths largely because
of quadratic self-attention complexity. In the recent months, a wide spectrum
of efficient, fast Transformers have been proposed to tackle this problem, more
often than not claiming superior or comparable model quality to vanilla
Transformer models. To this date, there is no well-established consensus on how
to evaluate this class of models. Moreover, inconsistent benchmarking on a wide
spectrum of tasks and datasets makes it difficult to assess relative model
quality amongst many models. This paper proposes a systematic and unified
benchmark, LRA, specifically focused on evaluating model quality under
long-context scenarios. Our benchmark is a suite of tasks consisting of
sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data
types and modalities such as text, natural, synthetic images, and mathematical
expressions requiring similarity, structural, and visual-spatial reasoning. We
systematically evaluate ten well-established long-range Transformer models
(Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers,
Synthesizers, Sparse Transformers, and Longformers) on our newly proposed
benchmark suite. LRA paves the way towards better understanding this class of
efficient Transformer models, facilitates more research in this direction, and
presents new challenging tasks to tackle. Our benchmark code will be released
at https://github.com/google-research/long-range-arena.
Users
Please
log in to take part in the discussion (add own reviews or comments).