Recently, it has been argued that encoder-decoder models can be made more
interpretable by replacing the softmax function in the attention with its
sparse variants. In this work, we introduce a novel, simple method for
achieving sparsity in attention: we replace the softmax activation with a ReLU,
and show that sparsity naturally emerges from such a formulation. Training
stability is achieved with layer normalization with either a specialized
initialization or an additional gating function. Our model, which we call
Rectified Linear Attention (ReLA), is easy to implement and more efficient than
previously proposed sparse attention mechanisms. We apply ReLA to the
Transformer and conduct experiments on five machine translation tasks. ReLA
achieves translation performance comparable to several strong baselines, with
training and decoding speed similar to that of the vanilla attention. Our
analysis shows that ReLA delivers high sparsity rate and head diversity, and
the induced cross attention achieves better accuracy with respect to
source-target word alignment than recent sparsified softmax-based models.
Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off')
for some queries, which is not possible with sparsified softmax alternatives.
%0 Generic
%1 zhang2021sparse
%A Zhang, Biao
%A Titov, Ivan
%A Sennrich, Rico
%D 2021
%K 2021 attention deep-learning
%T Sparse Attention with Linear Units
%U http://arxiv.org/abs/2104.07012
%X Recently, it has been argued that encoder-decoder models can be made more
interpretable by replacing the softmax function in the attention with its
sparse variants. In this work, we introduce a novel, simple method for
achieving sparsity in attention: we replace the softmax activation with a ReLU,
and show that sparsity naturally emerges from such a formulation. Training
stability is achieved with layer normalization with either a specialized
initialization or an additional gating function. Our model, which we call
Rectified Linear Attention (ReLA), is easy to implement and more efficient than
previously proposed sparse attention mechanisms. We apply ReLA to the
Transformer and conduct experiments on five machine translation tasks. ReLA
achieves translation performance comparable to several strong baselines, with
training and decoding speed similar to that of the vanilla attention. Our
analysis shows that ReLA delivers high sparsity rate and head diversity, and
the induced cross attention achieves better accuracy with respect to
source-target word alignment than recent sparsified softmax-based models.
Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off')
for some queries, which is not possible with sparsified softmax alternatives.
@misc{zhang2021sparse,
abstract = {Recently, it has been argued that encoder-decoder models can be made more
interpretable by replacing the softmax function in the attention with its
sparse variants. In this work, we introduce a novel, simple method for
achieving sparsity in attention: we replace the softmax activation with a ReLU,
and show that sparsity naturally emerges from such a formulation. Training
stability is achieved with layer normalization with either a specialized
initialization or an additional gating function. Our model, which we call
Rectified Linear Attention (ReLA), is easy to implement and more efficient than
previously proposed sparse attention mechanisms. We apply ReLA to the
Transformer and conduct experiments on five machine translation tasks. ReLA
achieves translation performance comparable to several strong baselines, with
training and decoding speed similar to that of the vanilla attention. Our
analysis shows that ReLA delivers high sparsity rate and head diversity, and
the induced cross attention achieves better accuracy with respect to
source-target word alignment than recent sparsified softmax-based models.
Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off')
for some queries, which is not possible with sparsified softmax alternatives.},
added-at = {2021-04-21T16:19:15.000+0200},
author = {Zhang, Biao and Titov, Ivan and Sennrich, Rico},
biburl = {https://www.bibsonomy.org/bibtex/29f7ac3647ae7fe00654cc71654467300/analyst},
description = {[2104.07012] Sparse Attention with Linear Units},
interhash = {3aec53479afcf5749a98ce61e048f8b4},
intrahash = {9f7ac3647ae7fe00654cc71654467300},
keywords = {2021 attention deep-learning},
note = {cite arxiv:2104.07012},
timestamp = {2021-04-21T16:19:15.000+0200},
title = {Sparse Attention with Linear Units},
url = {http://arxiv.org/abs/2104.07012},
year = 2021
}