Abstract
Recent trends of incorporating attention mechanisms in vision have led
researchers to reconsider the supremacy of convolutional layers as a primary
building block. Beyond helping CNNs to handle long-range dependencies,
Ramachandran et al. (2019) showed that attention can completely replace
convolution and achieve state-of-the-art performance on vision tasks. This
raises the question: do learned attention layers operate similarly to
convolutional layers? This work provides evidence that attention layers can
perform convolution and, indeed, they often learn to do so in practice.
Specifically, we prove that a multi-head self-attention layer with sufficient
number of heads is at least as expressive as any convolutional layer. Our
numerical experiments then show that self-attention layers attend to pixel-grid
patterns similarly to CNN layers, corroborating our analysis. Our code is
publicly available.
Users
Please
log in to take part in the discussion (add own reviews or comments).