The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR's slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR's convergence and improve detection performance. The core of SAM-DETR++ is a plug-and-play module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and exploits their features for semantic-aligned matching with enhanced representation capacity. Furthermore, SAM-DETR++ can effectively fuse multi-scale features in a coarse-to-fine manner on the basis of the designed semantic-aligned matching. Extensive experiments show that the proposed SAM-DETR++ achieves superior convergence speed and competitive detection accuracy. Additionally, as a plug-and-play method, SAM-DETR++ can complement existing DETR convergence solutions with even better performance, achieving 44.8% AP with merely 12 training epochs and 49.1% AP with 50 training epochs on COCO val2017 with ResNet-50. Codes are available at https://github.com/ZhangGongjie/SAM-DETR .
翻译:近期提出的检测Transformer(DETR)建立了完全端到端的目标检测范式。然而,DETR存在训练收敛缓慢的问题,这限制了其在各类检测任务中的适用性。我们观察到,DETR收敛缓慢的主要原因是目标查询与编码图像特征之间存在语义未对齐,导致目标查询难以匹配到相关区域。基于这一观察,我们设计了语义对齐匹配DETR++(SAM-DETR++)来加速DETR收敛并提升检测性能。SAM-DETR++的核心是一个即插即用模块,该模块将目标查询与编码图像特征投影到相同的特征嵌入空间,使每个目标查询能够轻松匹配到具有相似语义的相关区域。此外,SAM-DETR++搜索多个代表性关键点并利用其特征进行语义对齐匹配,从而增强表示能力。在此基础上,SAM-DETR++能够通过设计的语义对齐匹配,以从粗到细的方式有效融合多尺度特征。大量实验表明,所提出的SAM-DETR++实现了卓越的收敛速度和具有竞争力的检测精度。同时,作为一种即插即用方法,SAM-DETR++能够与现有DETR收敛解决方案互补,取得更优性能:在COCO val2017数据集上使用ResNet-50骨干网络,仅需12个训练周期即可达到44.8%的AP,50个训练周期达到49.1%的AP。代码开源地址:https://github.com/ZhangGongjie/SAM-DETR。