Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design

Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20\%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.

翻译：实时目标检测因其对高精度与低延迟的要求，在实际应用中至关重要。尽管检测Transformer（DETR）已展现出显著的性能提升，但当前实时DETR模型因主干网络过高的预训练开销而难以从头复现，这阻碍了新主干架构的探索，从而制约了研究进展。本文旨在证明，通过采用通用的优良设计，可以实现**高性能**与**低预训练成本**的兼顾。在对主干架构进行深入研究后，我们提出了多种尺度的EfficientNAT，其融合了现代高效卷积与局部注意力机制。此外，我们重新设计了采用局部注意力的混合编码器，显著提升了性能与推理速度。基于这些进展，我们提出了Le-DETR（**低**成本**高效**检测**Transformer**），该模型仅使用ImageNet1K和COCO2017训练数据集，即在实时检测中取得了新的**SOTA**性能，与先前方法相比在预训练阶段节省了约80%的图像数据。我们证明，经过精心设计的实时DETR模型无需复杂且计算昂贵的预训练即可实现强劲性能。大量实验表明，Le-DETR-M/L/X在COCO Val2017上分别达到**52.9/54.3/55.1 mAP**，在RTX4090上的推理时间为**4.45/5.01/6.68 ms**。其在速度相近的情况下，以**+0.6/-0.1 mAP**超越YOLOv12-L/X，同时实现**+20%** 的加速。与DEIM-D-FINE相比，Le-DETR-M以略快的推理速度实现**+0.2 mAP**的提升，并以仅**0.4 ms**的额外延迟，以**+0.4 mAP**超越DEIM-D-FINE-L。代码与权重将开源发布。