As an improvement over existing Dropout regularization variants for deep neural networks (e.g. regular Dropout, SpatialDropout, DropBlock) that have a randomized structure with certain fixed parameters, the authors develop a reinforcement learning approach for finding better Dropout patterns for various network architectures.
What can we learn from this paper?
That there is a way to automatically learn more efficient Dropout techniques for popular neural network architectures such as ResNets or Transformers.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Deep neural networks
- Dropout regularization
Since modern neural networks are nearly always overparameterized with respect to the complexity of the mapping between their inputs and outputs that they are trying to learn, the use of regularization when training them is essential in order to prevent overfitting. Although many popular regularization techniques exist (including L2 or L1 weight decay, data augmentation, early stopping, etc), Dropout has been one of the most common regularization approaches for deep neural networks since it was first introduced in the early 2010s.
The idea of the standard dropout is to randomly set activations in a neural network layer to zero during a round of training with the dropout probability p for each neuron, thus making sure that the network does not rely extensively on any given neurons in its predictions. During inference, no neurons are dropped, and the output is scaled by dividing it by 1 – p to account for the increase in the number of non-zero activations compared to training. Maximum regularization is achieved when p = 0.5 due to the fact that applying dropout is equivalent to adding a term to the loss of the network that is proportional to p(1 – p), and this term is maximized at p = 0.5.
Recently, a number of Dropout techniques have been suggested that, instead of randomly choosing the neurons to be dropped in a given layer, do it according to a structure selected based on the configuration of the network. In particular, SpatialDropout, typically used in early convolutional layers of the network, drops out entire feature maps for randomly chosen features instead of individual neurons. This is done because, if the nearby neurons are strongly correlated (which is usually the case in early layers where these neurons correspond to pixels in the input images), using the standard Dropout effectively just decreases the learning rate without providing a strong regularization benefit. Another approach, DropBlock, solves the same problem by dropping contiguous square regions of feature maps instead of dropping the whole maps as in SpatialDropout.
Building on these earlier works, in the AutoDropout technique developed in the present paper the dropout pattern at every layer and channel of the target network is learned by applying different patterns generated by the reinforcement learning (RL) controller, training the network with these, and using the resulting performance on a validation set to direct the search towards better patterns.
Based on the patterns from many existing dropout approaches, the authors have designed novel search spaces of structured dropout patterns for different network architectures that the RL algorithm can use to find the most performant patterns for each particular architecture. An example pattern for convolutional networks (obtained by applying rotation and shearing transformations to a base rectangular tile pattern) is shown in the bottom part of the picture taken from the paper.
As the authors note, AutoDropout can be considered a data augmentation method in the network’s hidden states. Unlike the traditional data augmentation methods, however, AutoDropout is not domain-specific, so the same general approach can be used for, for example, Transformers in NLP and convolutional networks in computer vision applications, only with different pattern search spaces.
Applying AutoDropout seems to noticeably improve the performance of many popular network architectures on common datasets. For example, the top-1 accuracy of ResNet-50 on ImageNet was increased by using the new technique from 76.5% to 78.7%. For the EfficientNet-B7 on the same dataset, the increase was from 84.1% to 84.7%. For NLP tasks, the perplexity of Transformer-XL on the Penn Treebank dataset was reduced from 56.0 to 54.9.
As the authors note, the computational cost of using AutoDropout is high, so the idea of the method is to find the best-performing patterns in advance and then use these by putting them into the existing deep learning pipelines, similarly to how AutoAugment is used.
The authors discuss a number of other topics of interest, such as ways to improve parallelism when training AutoDropout, and the characteristics of some of the patterns they have found. Overall, this paper seems to be an important new step towards automating Dropout regularization.