The authors used a version of the recently suggested MEAL technique (which involves knowledge distillation from multiple large teacher networks into a smaller student network via adversarial learning) to increase the top-1 accuracy of ResNet-50 on ImageNet with 224×224 input size to 80.67% without external training data or network architecture modifications.
What can we learn from this paper?
That even a relatively small network can be trained to achieve the accuracy of much larger networks with the right approach.
In a way, this is not surprising since modern deep neural networks are designed to be overparameterized to take advantage of the multitude of randomly initialized configurations as described in “The Lottery Ticket Hypothesis” paper, so it makes sense that a smaller network would be sufficient to achieve similar performance, but it is really nice to see how this can be implemented in practice.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Knowledge distillation
- Adversarial training
The technique of ensembling, or ensemble learning, which consists of combining predictions of multiple ML models, is a known way to improve predictive accuracy. It is widely used in Kaggle competitions, where achieving the best accuracy, even at the expense of a huge computational load, is the priority. However, in most practical applications ensembling is not widely used due to the expense and time needed to run each model during prediction.
The idea of the MEAL technique, as suggested in a recent article of the same first author, is to distill knowledge from multiple large neural networks (teachers) into a smaller student network, thus creating a computationally efficient new model that benefits from the ensembling effect. The student network is trained together with an additional discriminator network that is used to distinguish between the outputs generated by the layers of the student network and the corresponding outputs of a teacher network for each input.
In this paper, the authors simplified the MEAL technique by only considering the similarity (KL) loss of the final output layer for each network and not the intermediate layers, and by using the average of softmax probabilities of all teacher networks together and not just using one teacher at each step of training.
Using this approach, the authors were able to train the vanilla ResNet-50 architecture on ImageNet with no modifications, external training data, or tricks like AutoAugment, mixup, label smoothing, etc to achieve top-1 accuracy with 224×224 input images of 80.67%, which is by far the best result to date with this architecture. The paper shows that this accuracy can be further improved by using larger (380×380) images (81.72%), or using more data augmentation such as CutMix (80.98%).
Even for the smaller MobileNet V3 Small and EfficientNet-B0 models, the suggested training technique resulted in extending the original ImageNet top-1 accuracy by about 2.2%, showing the potential to train even smaller models to almost match the accuracy of large networks and thus achieve free performance improvement during the inference stage.
Considering how overparameterized most deep neural networks are nowadays (in order to find a better model by taking advantage of “The Lottery Ticket Hypothesis“), this seems like a really neat way to scale the resulting models back down while preserving the accuracy. Hopefully, the feasibility of this approach for practical tasks will be confirmed by further research.