End-to-End Object Detection with Transformers

Review of paper by Nicolas Carion, Francisco Massa, Gabriel Synnaeve et al, Facebook AI Research, 2020

This paper describes a completely automated end-to-end object detection system combining convolutional networks and Transformers. The new model shows competitive performance on par with Faster R-CNN and can be generalized to other tasks such as panoptic segmentation.

What can we learn from this paper?

That Transformer architectures can be effectively used for object detection tasks.

Prerequisites (to understand the paper, what does one need to be familiar with?)

  • Object detection
  • Convolutional networks
  • Transformer networks


Object detection is one of the most common tasks in computer vision, and there are many research papers and deep learning architectures devoted to this task. In the popular Faster R-CNN architecture and its variants, the first stage deep network proposes a set of bounding boxes in the image that may contain objects based on initial anchor regions, which is then trimmed using non-maximum suppression and provided to the second stage for classification. Both stages share the same convolutional feature maps, contributing to the efficiency of the overall system. Still, the overall architecture is quite big and complex, causing the need for less accurate, but smaller and faster alternatives such as YOLO and SSD, which are used in many practical applications.

In this paper, the authors propose a different deep learning architecture named DETR (DEtection TRansformer), which directly predicts the set of objects detected in each image. To train the algorithm and evaluate its performance, a loss function is defined which is a sum of negative log-likelihood loss for classification and a combination of L1 (absolute difference) and generalized Intersection over Union (IoU) losses for the bounding box coordinates. The loss is based on matching each ground truth object to a predicted object, for which the Hungarian algorithm is used.

The architecture of DETR consists of a backbone CNN such as an ImageNet-pretrained ResNet, followed by a full encoder-decoder Transformer and a simple feedforward network for the final detection prediction. The Transformer part gives the attention mechanism to account for relationships between the objects and to provide a global context.

The resulting network was evaluated on the COCO 2017 dataset. Against the Faster R-CNN baseline, the new model achieved slightly superior performance with a similar number of parameters, or similar performance when a smaller and faster backbone network was used.

By adding a mask head on top of the decoder outputs, DETR can also be used for panoptic segmentation (a combination of semantic and instance segmentation). On the COCO dataset, the new approach achieved close to the state of the art (very few papers and results are available so far for this newer task).

While the performance of DETR as reported in the paper was noticeably lower than the current state of the art in object detection, its conceptual simplicity and the end-to-end nature, combined with it being just the first step and the possibility of future improvements, make it a very interesting new approach.

Original paper link

Authors’ review

Github repository

Suggested further reading

Leave a Reply