Modern deep learning models, especially in natural language processing, usually strive to achieve better accuracy by increasing the parameter size of the model (often combined with training on larger datasets), which comes at a huge computational cost. In this paper, in order to achieve better computational efficiency, the authors divide the fully connected layers in their Transformer model’s blocks into sets of many alternatives (experts), whereby only one expert is chosen for each given input in each layer. This provides the opportunity to increase the size of the model as desired (within the available memory constraints) by increasing the count of experts while maintaining a constant computational complexity per input token.
As an improvement over existing Dropout regularization variants for deep neural networks (e.g. regular Dropout, SpatialDropout, DropBlock) that have a randomized structure with certain fixed parameters, the authors develop a reinforcement learning approach for finding better Dropout patterns for various network architectures.
The authors develop a neural attention layer for working with 3D point data and implement it in a Point Transformer network that shows new state-of-the-art results (in some cases by a significant margin) on a number of standard 3D benchmarks.
In this paper, the author shows that neural networks trained using first-order gradient descent with a small learning rate can be represented as similarity kernels and that they memorize the training points and subsequently use this information to make predictions the same way as SVMs and other kernel methods. This insight should lead to a better general understanding of how deep neural networks operate and, hopefully, will help improve future algorithms.
Inspired by the widespread use of the standard MNIST as a playground dataset for deep learning, the author has developed a new MNIST-1D dataset that is even smaller (just a one-dimensional sequence of 40 numbers for each sample) but is harder to predict on, demonstrates a more obvious difference in performance across network architectures, and is more conducive to exploring various interesting topics such as, for example, analyzing “lottery tickets” and the double descent phenomenon.
In this paper, the authors examine in detail the phenomenon of gradient starvation, which was originally introduced by the same research group in 2018, for neural networks trained with the common cross-entropy loss. Gradient starvation occurs when the presence of easy-to-learn features in a dataset prevents the learning of other equally informative features, which may lead to a lack of robustness in the trained models that rely only on these few features. The authors propose a new Spectral Decoupling regularization method to combat this problem.