Contrary to the common consensus that self-attention is largely responsible for the superior performance of Transformer models on various NLP tasks, this paper suggests that substituting outputs of self-attention layers with random or simply synthesized data is sufficient to achieve similar results with better efficiency.
What can we learn from this paper?
That self-attention layers in Transformer models may not be as important as most people think.
Prerequisites (to understand the paper, what does one need to be familiar with?)
- Attention layers
- Transformer networks
Discussion
Since they were introduced in 2017, Transformers have to a great extent replaced previously dominant autoregressive and recurrent models in state-of-the-art NLP research. However, there is still a limited understanding of how they achieve their superior performance, and the relative inefficiency of Transformers is well known, especially in terms of the requirements of their self-attention layers where the computationally expensive dot products involving query, key, and value matrices are calculated. In the recent Reformer paper, the dot product attention is replaced by an approximation using locality-sensitive hashing. However, this article suggests that even that may be unnecessary and that self-attention layers are not especially critical to the overall performance of Transformer networks.
To prove their point, the authors replace the calculation of self-attention with one of several simple alternatives that do not involve any token-to-token interactions, such as a 2-layer dense synthesizer network that predicts the attention weights based on each token in the sequence, or even a random synthesizer that uses a randomly initialized matrix (which can be either fixed or globally trainable), which the authors did not expect to work at all but which turned out to be a strong baseline.
The synthesized attention models were tested on a number of NLP tasks involving popular datasets, such as language modeling on the LM1B dataset, abstractive summarization on the CNN/Dailymail dataset, dialogue generation on the PersonaChat dataset, and multi-task language understanding on the GLUE and SuperGLUE benchmarks. On almost all evaluated tasks, the synthesized attention achieved performance reasonably close to the dot product self-attention, even improving upon it for dialogue generation, while using the synthetic attention reduced computational complexity and parameter costs by about 10% compared to the standard Transformer.
One important caveat is that the author’s conclusions are valid only for self-attention layers and not for cross-sentence attention. This became evident with the multi-task language understanding benchmarks, where the self-attention functions as a form of cross-attention by concatenating sentence pairs. As a result, synthesized attention performed considerably worse on the GLUE and SuperGLUE benchmarks. However, complementing the base text-to-text T5 Transformer model with trained synthetic attention boosted the performance compared to vanilla T5.
In general, this looks like a very important paper in terms of gaining a deeper insight into the inner workings of Transformer models.