This paper describes a new training approach for Transformer network architectures used for language modeling tasks. The authors demonstrate that their technique results in greatly improved training efficiency and better performance on common benchmark datasets (GLUE, SQuAD) compared to other state-of-the-art NLP models of similar size.
This paper suggests a new algorithm for training deep neural networks that can be run efficiently without a GPU.
This paper examines the theoretical reasons for using batch normalization in deep residual networks and suggests a simpler alternative solution.
This paper presents a block-based deep neural architecture for univariate time series point forecasting that is similar in its philosophy to very deep models (e.g. ResNet) used in more common deep learning applications such as image recognition. Furthermore, the authors demonstrate how their approach can be used to build predictive models that are interpretable.
This paper presents a new neural network activation function and shows with a number of examples that it often increases the accuracy of deep networks.
A great review of many state-of-the-art tricks that can be used to improve the performance of a deep convolutional network (ResNet), combined with actual implementation details, source code, and performance results. A must-read for all Kaggle competitors or anyone who wants to achieve maximum performance on computer vision tasks.