Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.
翻译:现代检测Transformer(DETRs)使用一组目标查询预测边界框列表,根据分类置信度分数排序,并选取排名靠前的预测作为给定输入图像的最终检测结果。一个高性能目标检测器需要对边界框预测进行精确排序。对于基于DETR的检测器,由于分类分数与定位精度之间存在错位,排名靠前的边界框往往具有较低的定位质量,从而阻碍了高质量检测器的构建。本文通过提出一系列面向排序的设计(统称为Rank-DETR),构建了简单且高性能的基于DETR的目标检测器。主要贡献包括:(i)一种面向排序的架构设计,能够促进正预测并抑制负预测,以确保较低的误检率;(ii)一种面向排序的损失函数和匹配代价设计,在排序过程中优先考虑定位精度更高的预测,从而提升高IoU阈值下的平均精度(AP)。我们将所提方法应用于改进近期SOTA方法(如H-DETR和DINO-DETR),在使用不同骨干网络(如ResNet-50、Swin-T和Swin-L)时,在COCO目标检测任务上取得了强劲结果,验证了方法的有效性。代码开源地址:\url{https://github.com/LeapLabTHU/Rank-DETR}。