Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in the encoder. However, the excessively increased tokens in multi-scale features, especially for about 75\% of low-level features, are quite computationally inefficient, which hinders real applications of DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce the GFLOPs of the detection head by 60\% while keeping 99\% of the original performance. Specifically, we design an efficient encoder block to update high-level features (corresponding to small-resolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way. In addition, to better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights. Comprehensive experiments validate the effectiveness and efficiency of the proposed Lite DETR, and the efficient encoder strategy can generalize well across existing DETR-based models. The code will be available in \url{https://github.com/IDEA-Research/Lite-DETR}.
翻译:近期基于DEtection TRansformer(DETR)的模型取得了显著性能。这一成功离不开编码器中多尺度特征融合的重新引入。然而,多尺度特征中过度增加的标记数量(尤其是约75%的低层特征)导致计算效率低下,阻碍了DETR模型的实际应用。本文提出Lite DETR——一种简单而高效的端到端目标检测框架,能在保持原始性能99%的同时,将检测头的GFLOPs有效降低60%。具体而言,我们设计了一种高效的编码器模块,以交错方式更新高层特征(对应小分辨率特征图)和低层特征(对应大分辨率特征图)。此外,为更好地融合跨尺度特征,我们开发了一种关键感知可变形注意力机制,以预测更可靠的注意力权重。综合实验验证了所提Lite DETR的有效性和高效性,且该高效编码器策略可良好泛化至现有基于DETR的模型。代码将开源在\url{https://github.com/IDEA-Research/Lite-DETR}。