Recently, end-to-end transformer-based detectors~(DETRs) have achieved remarkable performance. However, the issue of the high computational cost of DETRs has not been effectively addressed, limiting their practical application and preventing them from fully exploiting the benefits of no post-processing, such as non-maximum suppression (NMS). In this paper, we first analyze the influence of NMS in modern real-time object detectors on inference speed, and establish an end-to-end speed benchmark. To avoid the inference delay caused by NMS, we propose a Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge. Specifically, we design an efficient hybrid encoder to efficiently process multi-scale features by decoupling the intra-scale interaction and cross-scale fusion, and propose IoU-aware query selection to improve the initialization of object queries. In addition, our proposed detector supports flexibly adjustment of the inference speed by using different decoder layers without the need for retraining, which facilitates the practical application of real-time object detectors. Our RT-DETR-L achieves 53.0% AP on COCO val2017 and 114 FPS on T4 GPU, while RT-DETR-X achieves 54.8% AP and 74 FPS, outperforming all YOLO detectors of the same scale in both speed and accuracy. Furthermore, our RT-DETR-R50 achieves 53.1% AP and 108 FPS, outperforming DINO-Deformable-DETR-R50 by 2.2% AP in accuracy and by about 21 times in FPS. ource code and pre-trained models are available at https://github.com/lyuwenyu/RT-DETR.
翻译:近期,基于端到端 Transformer 的检测器(DETRs)取得了显著性能。然而,DETRs 的高计算成本问题尚未得到有效解决,限制了其实际应用,并使其无法充分发挥无后处理(如非极大值抑制,NMS)的优势。本文首先分析了现代实时目标检测器中 NMS 对推理速度的影响,并建立了一个端到端速度基准。为避免 NMS 导致的推理延迟,我们提出了实时检测 Transformer(RT-DETR),据我们所知,这是首个实时端到端目标检测器。具体而言,我们设计了一种高效混合编码器,通过解耦尺度内交互与跨尺度融合来高效处理多尺度特征,并提出了 IoU 感知查询选择以改进目标查询的初始化。此外,所提出的检测器支持通过使用不同解码器层灵活调整推理速度,无需重新训练,这有助于实时目标检测器的实际应用。我们的 RT-DETR-L 在 COCO val2017 上达到 53.0% AP,在 T4 GPU 上达到 114 FPS;而 RT-DETR-X 达到 54.8% AP 和 74 FPS,在速度和精度上均超越所有同尺度 YOLO 检测器。此外,我们的 RT-DETR-R50 达到 53.1% AP 和 108 FPS,在精度上超越 DINO-Deformable-DETR-R50 2.2% AP,在 FPS 上提升约 21 倍。源代码与预训练模型已发布于 https://github.com/lyuwenyu/RT-DETR。