AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Review of paper by Juntang Zhuang1, Tommy Tang2, Yifan Ding3, et al, 1Yale University, 2University of Illinois at Urbana-Champaign, and 3University of Central Florida, 2020

This paper develops a new neural network optimizer that aims to combine the fast convergence and stability of adaptive methods such as Adam and the generalization power of SGD.


Attention Augmented Differentiable Forest for Tabular Data

Review of paper by Yingshi Chen, Xiamen University, 2020

The author has developed a new “differentiable forest”-type neural network framework for predictions on tabular data that has some similarity to the recently suggested NODE architecture and employs squeeze-and-excitation “tree attention blocks” (TABs) to show performance superior to gradient boosted decision trees (e.g. XGBoost, LightGBM, Catboost) on a number of benchmarks.


MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks

Review of paper by Zhiqiang Shen and Marios Savvides, Carnegie Mellon University, 2020

The authors used a version of the recently suggested MEAL technique (which involves knowledge distillation from multiple large teacher networks into a smaller student network via adversarial learning) to increase the top-1 accuracy of ResNet-50 on ImageNet with 224×224 input size to 80.67% without external training data or network architecture modifications.


Big Bird: Transformers for Longer Sequences

Review of paper by Manzil Zaheer, Guru Guruganesh, Avinava Dubey et al, Google Research, 2020

In this paper, the authors present a Transformer attention model with linear complexity that is mathematically proven to be Turing complete (and thus as powerful as the original quadratic attention model) and achieves new state-of-the-art results on many NLP tasks involving long sequences (e.g. question answering and summarization), as well as genomics data.


No more pages to load