Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

The introduction of DETR represents a new paradigm for object detection. However, its decoder conducts classification and box localization using shared queries and cross-attention layers, leading to suboptimal results. We observe that different regions of interest in the visual feature map are suitable for performing query classification and box localization tasks, even for the same object. Salient regions provide vital information for classification, while the boundaries around them are more favorable for box regression. Unfortunately, such spatial misalignment between these two tasks greatly hinders DETR's training. Therefore, in this work, we focus on decoupling localization and classification tasks in DETR. To achieve this, we introduce a new design scheme called spatially decoupled DETR (SD-DETR), which includes a task-aware query generation module and a disentangled feature learning process. We elaborately design the task-aware query initialization process and divide the cross-attention block in the decoder to allow the task-aware queries to match different visual regions. Meanwhile, we also observe that the prediction misalignment problem for high classification confidence and precise localization exists, so we propose an alignment loss to further guide the spatially decoupled DETR training. Through extensive experiments, we demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work. For instance, we improve the performance of Conditional DETR by 4.5 AP. By spatially disentangling the two tasks, our method overcomes the misalignment problem and greatly improves the performance of DETR for object detection.

翻译：DETR的提出代表了目标检测的新范式。然而，其解码器使用共享查询和交叉注意力层执行分类与边界框定位任务，导致效果次优。我们观察到，视觉特征图中不同的感兴趣区域分别适用于查询分类和边界框定位任务——即使针对同一目标，显著区域为分类提供关键信息，而目标边界附近区域更有利于边界框回归。不幸的是，这两个任务之间的空间错位严重阻碍了DETR的训练。因此，本文聚焦于解耦DETR中的定位与分类任务。为实现这一目标，我们提出了一种名为空间解耦DETR（SD-DETR）的新设计框架，包括任务感知查询生成模块与解耦特征学习过程。我们精心设计了任务感知查询的初始化流程，并分割解码器中的交叉注意力块，使任务感知查询能够匹配不同的视觉区域。同时，我们观察到高分类置信度与精准定位之间存在的预测错位问题，因此提出一种对齐损失函数，以进一步引导空间解耦DETR的训练。通过大量实验，我们证明该方法在MSCOCO数据集上相比先前工作取得了显著提升。例如，我们将Conditional DETR的性能提升了4.5个AP。通过空间分离两项任务，我们的方法克服了错位问题，大幅提升了DETR在目标检测任务中的性能。