Modern deep learning models, especially in natural language processing, usually strive to achieve better accuracy by increasing the parameter size of the model (often combined with training on larger datasets), which comes at a huge computational cost. In this paper, in order to achieve better computational efficiency, the authors divide the fully connected layers in their Transformer model’s blocks into sets of many alternatives (experts), whereby only one expert is chosen for each given input in each layer. This provides the opportunity to increase the size of the model as desired (within the available memory constraints) by increasing the count of experts while maintaining a constant computational complexity per input token.
The authors develop a neural attention layer for working with 3D point data and implement it in a Point Transformer network that shows new state-of-the-art results (in some cases by a significant margin) on a number of standard 3D benchmarks.
The author has developed a new “differentiable forest”-type neural network framework for predictions on tabular data that has some similarity to the recently suggested NODE architecture and employs squeeze-and-excitation “tree attention blocks” (TABs) to show performance superior to gradient boosted decision trees (e.g. XGBoost, LightGBM, Catboost) on a number of benchmarks.
In this paper, the authors present a Transformer attention model with linear complexity that is mathematically proven to be Turing complete (and thus as powerful as the original quadratic attention model) and achieves new state-of-the-art results on many NLP tasks involving long sequences (e.g. question answering and summarization), as well as genomics data.
This paper suggests an approximate way of calculating self-attention in Transformer architectures that has linear space and time complexity in terms of the sequence length, with the resulting performance on benchmark datasets similar to that of the RoBERTa model based on the original Transformers with much less efficient quadratic attention complexity.