The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.
翻译:近期提出的DETR方法将Transformer编码器-解码器架构应用于目标检测任务,并取得了令人瞩目的性能。本文针对其训练收敛缓慢这一关键问题,提出了一种条件式交叉注意力机制以加速DETR训练。我们的方法源于以下观察:DETR中的交叉注意力机制高度依赖内容嵌入来定位物体边界框的四个顶点并预测边界框,这提高了对高质量内容嵌入的需求,从而增加了训练难度。我们提出的方法(命名为Conditional DETR)从解码器嵌入中学习条件空间查询,用于解码器的多头交叉注意力。其优势在于,通过条件空间查询,每个交叉注意力头能够聚焦于包含不同区域的带状区域(例如物体某个顶点或物体边界框内部区域)。这缩小了目标分类和边界框回归中不同区域的空间定位范围,从而减轻了对内容嵌入的依赖性并降低了训练难度。实验结果表明,对于R50和R101主干网络,Conditional DETR的收敛速度提升6.7倍;对于更强的DC5-R50和DC5-R101主干网络,收敛速度提升10倍。代码已开源:https://github.com/Atten4Vis/ConditionalDETR。