Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Review of paper by William Fedus, Barret Zoph, and Noam Shazeer, Google Brain, 2021.

Modern deep learning models, especially in natural language processing, usually strive to achieve better accuracy by increasing the parameter size of the model (often combined with training on larger datasets), which comes at a huge computational cost. In this paper, in order to achieve better computational efficiency, the authors divide the fully connected layers in their Transformer model’s blocks into sets of many alternatives (experts), whereby only one expert is chosen for each given input in each layer. This provides the opportunity to increase the size of the model as desired (within the available memory constraints) by increasing the count of experts while maintaining a constant computational complexity per input token.


Big Bird: Transformers for Longer Sequences

Review of paper by Manzil Zaheer, Guru Guruganesh, Avinava Dubey et al, Google Research, 2020

In this paper, the authors present a Transformer attention model with linear complexity that is mathematically proven to be Turing complete (and thus as powerful as the original quadratic attention model) and achieves new state-of-the-art results on many NLP tasks involving long sequences (e.g. question answering and summarization), as well as genomics data.


Linformer: Self-Attention with Linear Complexity

Review of paper by Sinong Wang, Belinda Z. Li, Madian Khabsa et al, Facebook AI Research, 2020

This paper suggests an approximate way of calculating self-attention in Transformer architectures that has linear space and time complexity in terms of the sequence length, with the resulting performance on benchmark datasets similar to that of the RoBERTa model based on the original Transformers with much less efficient quadratic attention complexity.


Synthesizer: Rethinking Self-Attention in Transformer Models

Review of paper by Yi Tay, Dara Bahri, Donald Metzler et al, Google Research, 2020

Contrary to the common consensus that self-attention is largely responsible for the superior performance of Transformer models on various NLP tasks, this paper suggests that substituting outputs of self-attention layers with random or simply synthesized data is sufficient to achieve similar results with better efficiency.


ELECTRA: Pre-training Text Encoders As Discriminators Rather Than Generators

Review of paper by Kevin Clark1, Minh-Thang Luong2, Quoc V. Le2, and Christopher D. Manning1, 1Stanford University and 2Google Brain, 2020

This paper describes a new training approach for Transformer network architectures used for language modeling tasks. The authors demonstrate that their technique results in greatly improved training efficiency and better performance on common benchmark datasets (GLUE, SQuAD) compared to other state-of-the-art NLP models of similar size.


No more pages to load