The authors used a version of the recently suggested MEAL technique (which involves knowledge distillation from multiple large teacher networks into a smaller student network via adversarial learning) to increase the top-1 accuracy of ResNet-50 on ImageNet with 224×224 input size to 80.67% without external training data or network architecture modifications.
The authors used contrastive loss, which has recently been shown to be very effective at learning deep neural network representations in the self-supervised setting, for supervised learning, and achieved better results than those obtained with the cross-entropy loss for ResNet-50 and ResNet-200.
The authors suggest a new ResNet-like network architecture that incorporates attention across groups of feature maps. Compared to previous attention models such as SENet and SKNet, the new attention block applies the squeeze-and-attention operation separately to each of the selected groups, which is done in a computationally efficient way and implemented in a simple modular structure.
This paper examines the theoretical reasons for using batch normalization in deep residual networks and suggests a simpler alternative solution.