Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.
翻译:近年来,实时目标检测技术发展迅速。YOLO系列检测器作为最著名的基于CNN的目标检测模型之一,其地位不容忽视。最新版本YOLOv26近期发布,而YOLOv12在COCO val2017数据集上以55.2 mAP取得了最先进的性能表现。与此同时,基于Transformer的目标检测模型(亦称DEtection TRansformer,DETR)也展现出卓越性能。RT-DETR作为杰出代表,在发布时即在速度与精度上全面超越YOLO系列。其继任者RT-DETRv2在COCO val2017数据集上实现了53.4 mAP。然而,尽管这些模型性能卓越,却都存在信息流失的问题。它们主要关注前景目标的特征,而忽视了背景提供的上下文信息。我们认为背景信息能显著提升目标检测任务的性能。例如,汽车更可能出现在道路而非办公室环境中,野生动物更可能出现在森林或偏远地区而非繁华街道。为弥补这一缺陷,我们提出名为Association DETR的目标检测模型,该模型在COCO val2017数据集上相较于其他目标检测模型取得了最先进的性能。