What can we learn from this paper?
That a small tweak to the Adam algorithm may help improve its generalization ability.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Neural networks
- Gradient descent optimization
First-order gradient descent optimization with backpropagation is the standard way of training deep neural networks. However, there are various algorithms for doing that. The two most popular optimizers in 2020 are the basic SGD (stochastic gradient descent), which may incorporate additional features such as momentum, and the even more widely used Adam (adaptive moment estimation), which computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
A good overview of stochastic gradient descent (written in 2014, before Adam was introduced) can be found in Geoffrey Hinton’s lectures at the University of Toronto. One particular insight given there is that in the vicinity of an error function minimum, where the surface of this function can be approximated as an elliptical bowl with respect to any pair of parameters, the direction of the steepest descent only points towards the minimum if the ellipse is a circle. When the ellipse is elongated, the gradient is actually larger in the shorter dimension (in which we want to move less) and smaller in the longer direction (where a larger move is required). This discrepancy can either lead to instability of gradient descent if a larger learning rate (the proportionality coefficient between the current gradient and the amount of change applied to the optimized parameters) is used or require longer training and extra attention to learning rate schedules.
Basically, what is desirable during optimization is to make larger steps in the directions with consistent gradients, and smaller steps in the directions where the gradient is inconsistent. The Adam algorithm addresses this need by keeping track of the exponential moving average of each component of the gradient (first moment) and its square (the second moment) and, instead of the gradient value (like in SGD), multiplying the learning rate by the ratio of the first moment and the square root of the second moment. This should lead to much smaller steps in the directions in which the gradient has recently not been steady while keeping the steps larger in the direction of the consistent gradient (since in this case the ratio of the first and second moment will be close to 1).
One disadvantage of Adam, which may lead to its inferior generalization performance, is that in situations with a consistent gradient the step sizes do not depend on the value of the gradient, while for an ideal optimizer the steps should be larger for a large consistent gradient than for a small consistent one. In the picture (in which I find the authors’ current step size comments to be confusing), this means that the step sizes in region 3 should ideally be larger than the step sizes in region 1, which is not the case for Adam.
The AdaBelief optimizer aims to correct this issue by replacing the second moment of the gradient in Adam with the second moment of the difference between the current gradient and its exponential moving average. According to the authors, when the current value of the gradient deviates much from its moving average, we have weak “belief” in the current value and make the steps smaller, while when the current value is consistent with the recent average, we have strong “belief” and increase the step size.
The authors provide some mathematical bounds on the convergence of AdaBelief and analyze its performance on common datasets such as CIFAR-10, CIFAR-100, and ImageNet using standard ResNet, DenseNet, and VGG network configurations. It does seem from the results that the new optimizer shows generalization that is similar to SGD and is superior to Adam. The authors look at other networks as well, such as LSTMs on the Penn Treebank dataset and generative adversarial networks such as WGAN. In all cases, the results show fast and stable convergence combined with good generalization.
With the current state of deep learning optimizers, it may not be easy for new arrivals to capture significant attention. For most practical tasks, Adam seems to be good enough, while those wanting superior accuracy tend to use SGD at the expense of longer training. Several recent promising optimizers, such as AdaBound and RAdam, have only been adopted by a small share of practitioners. Thus, it remains to be seen whether AdaBelief will become widely used by the deep learning community, but it seems like an interesting and promising addition to the existing range of deep neural network optimizers.