The authors suggest a new ResNet-like network architecture that incorporates attention across groups of feature maps. Compared to previous attention models such as SENet and SKNet, the new attention block applies the squeeze-and-attention operation separately to each of the selected groups, which is done in a computationally efficient way and implemented in a simple modular structure.
What can we learn from this paper?
This paper presents an improvement over the SENet architecture, resulting in noticeably better performance on several popular benchmarks.
Prerequisites (to understand the paper, what does one need to be familiar with?)
- Residual networks
- Attention networks
To develop a more efficient modular implementation of residual networks with attention.
Despite the availability of more advanced networks, especially those derived via Neural Architecture Search (NAS), the standard ResNet architecture remains the most popular solution for downstream applications (such as object detection and semantic segmentation) in the field of computer vision. This is largely due to its simplicity and efficiency, which makes it easily trainable with regular GPUs.
To improve the performance of ResNet models, the recent SENet and SKNet variants have incorporated attention mechanisms applied across various feature maps. The current paper describes ResNeSt (where the letter S stands for Split Attention). In this implementation, all features are divided into K groups, where K is the cardinality hyperparameter. The radix hyperparameter R indicates the number of splits within a cardinal group.
Within each group, the network calculates the attention weights across all splits, multiplies them by each feature map, and adds them together to generate the output representation. These representations are then concatenated along the channel dimension, and the result is added to the input, providing a shortcut connection as in a regular ResNet.
The paper discusses various implementation details and presents the results of ResNeSt-50 and ResNeSt-101 on ImageNet, as well as other datasets such as MS-COCO, CitiScapes, and ADE20K. In all cases, the new architecture shows a significant improvement over ResNet models of the same size (e.g. top 1 accuracy of 81.8% on ImageNet for ResNeSt-50 as opposed to 76.9% for ResNet-50, 80.3% for SENet-50, and 80.7% for SKNet-50) without, according to the authors, introducing additional computational costs.
ResNeSt definitely looks like another promising architecture, and the availability of the source code and pretrained models makes it easy for anyone to try it out.
Original paper link
ResNeSt: Split-Attention Networks by H. Zhang, C. Wu, Z. Zhang et al, 2020
Squeeze-and-Excitation Networks by J. Hu, L. Shen, S. Albanie et al, 2017
Selective Kernel Networks by X. Li, W. Wang, X. Hu, J. Yang, 2020