The training paradigm of DETRs is heavily contingent upon pre-training their backbone on the ImageNet dataset. However, the limited supervisory signals provided by the image classification task and one-to-one matching strategy result in an inadequately pre-trained neck for DETRs. Additionally, the instability of matching in the early stages of training engenders inconsistencies in the optimization objectives of DETRs. To address these issues, we have devised an innovative training methodology termed step-by-step training. Specifically, in the first stage of training, we employ a classic detector, pre-trained with a one-to-many matching strategy, to initialize the backbone and neck of the end-to-end detector. In the second stage of training, we froze the backbone and neck of the end-to-end detector, necessitating the training of the decoder from scratch. Through the application of step-by-step training, we have introduced the first real-time end-to-end object detection model that utilizes a purely convolutional structure encoder, DETR with YOLO (DEYO). Without reliance on any supplementary training data, DEYO surpasses all existing real-time object detectors in both speed and accuracy. Moreover, the comprehensive DEYO series can complete its second-phase training on the COCO dataset using a single 8GB RTX 4060 GPU, significantly reducing the training expenditure. Source code and pre-trained models are available at https://github.com/ouyanghaodong/DEYO.
翻译:DETR的训练范式严重依赖于在ImageNet数据集上预训练其骨干网络。然而,图像分类任务提供的有限监督信号与一对一匹配策略导致DETR的颈部网络预训练不充分。此外,训练早期阶段匹配的不稳定性造成了DETR优化目标的不一致性。为解决这些问题,我们设计了一种创新的训练方法,称为逐步训练。具体而言,在第一阶段训练中,我们采用经过一对一匹配策略预训练的经典检测器来初始化端到端检测器的骨干与颈部网络。在第二阶段训练中,我们冻结端到端检测器的骨干与颈部网络,要求从头训练解码器。通过应用逐步训练,我们首次提出了采用纯卷积结构编码器的实时端到端目标检测模型——DETR with YOLO(DEYO)。在不依赖任何额外训练数据的情况下,DEYO在速度和精度上均超越了所有现有实时目标检测器。此外,完整的DEYO系列可在单块8GB RTX 4060 GPU上完成COCO数据集的第二阶段训练,显著降低了训练成本。源代码与预训练模型已开源至https://github.com/ouyanghaodong/DEYO。