Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR.

翻译：本文提出了一种增强DETR（DEtection TRansformer）收敛性与性能的通用方案。我们从新视角研究了Transformer中收敛缓慢的问题，指出其根源在于自注意力机制未对输入引入结构性偏置。为解决该问题，在通过提出的定量宏观相关性（MC）指标验证其统计显著性的基础上，我们探索将位置关系先验作为注意力偏置来增强物体检测。我们提出的Relation-DETR方法引入编码器构建位置关系嵌入以实现渐进式注意力优化，进一步将传统DETR的流式处理流程扩展为对比关系流程，以解决非重复预测与正样本监督之间的冲突。在通用数据集和任务专用数据集上的大量实验证明了本方法的有效性。在相同配置下，Relation-DETR在COCO val2017数据集上相比现有DETR检测器实现了显著性能提升（较DINO提升+2.0% AP）、达到最先进性能（1x设置下51.7% AP，2x设置下52.1% AP），并展现出显著更快的收敛速度（仅2个训练周期即可获得超过40% AP）。此外，所提出的关系编码器可作为通用即插即用组件，为理论上任何类DETR方法带来明显改进。我们还提出了类别无关检测数据集SA-Det-100k。在该数据集上的实验结果表明，所提出的显式位置关系实现了1.3% AP的明显提升，彰显了其在通用物体检测领域的潜力。代码与数据集已发布于https://github.com/xiuqhou/Relation-DETR。