DEYOv3: DETR with YOLO for Real-time Object Detection

Recently, end-to-end object detectors have gained significant attention from the research community due to their outstanding performance. However, DETR typically relies on supervised pretraining of the backbone on ImageNet, which limits the practical application of DETR and the design of the backbone, affecting the model's potential generalization ability. In this paper, we propose a new training method called step-by-step training. Specifically, in the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch. Due to this training method, the object detector does not need the additional dataset (ImageNet) to train the backbone, which makes the design of the backbone more flexible and dramatically reduces the training cost of the detector, which is helpful for the practical application of the object detector. At the same time, compared with the DETR-like model, the step-by-step training method can achieve higher accuracy than the traditional training method of the DETR-like model. With the aid of this novel training method, we propose a brand-new end-to-end real-time object detection model called DEYOv3. DEYOv3-N achieves 41.1% on COCO val2017 and 270 FPS on T4 GPU, while DEYOv3-L achieves 51.3% AP and 102 FPS. Without the use of additional training data, DEYOv3 surpasses all existing real-time object detectors in terms of both speed and accuracy. It is worth noting that for models of N, S, and M scales, the training on the COCO dataset can be completed using a single 24GB RTX3090 GPU.

翻译：近期，端到端目标检测器因其卓越性能受到研究界广泛关注。然而，DETR通常依赖监督预训练在ImageNet上的骨干网络，这限制了DETR的实际应用及骨干网络的设计，进而影响模型潜在泛化能力。本文提出一种名为"逐步训练"的新训练方法。具体而言，第一阶段利用已训练的一对多YOLO检测器初始化端到端检测器；第二阶段保持骨干网络与编码器与DETR类模型一致，但仅需从头训练检测器。该训练方法使目标检测器无需额外数据集（ImageNet）训练骨干网络，从而提升骨干网络设计灵活性，并显著降低检测器训练成本，有利于目标检测器的实际应用。同时，相比DETR类模型，逐步训练方法能获得比传统DETR类模型训练方法更高的精度。借助这一新型训练方法，我们提出名为DEYOv3的全新端到端实时目标检测模型。DEYOv3-N在COCO val2017上达到41.1%的AP，在T4 GPU上实现270 FPS；DEYOv3-L则达到51.3% AP与102 FPS。在不使用额外训练数据的情况下，DEYOv3在速度与精度两方面均超越所有现有实时目标检测器。值得注意的是，对于N、S、M三种尺度的模型，仅需单张24GB RTX3090 GPU即可完成COCO数据集上的训练。