In this paper, we present DAT, a Depth-Aware Transformer framework designed for camera-based 3D detection. Our model is based on observing two major issues in existing methods: large depth translation errors and duplicate predictions along depth axes. To mitigate these issues, we propose two key solutions within DAT. To address the first issue, we introduce a Depth-Aware Spatial Cross-Attention (DA-SCA) module that incorporates depth information into spatial cross-attention when lifting image features to 3D space. To address the second issue, we introduce an auxiliary learning task called Depth-aware Negative Suppression loss. First, based on their reference points, we organize features as a Bird's-Eye-View (BEV) feature map. Then, we sample positive and negative features along each object ray that connects an object and a camera and train the model to distinguish between them. The proposed DA-SCA and DNS methods effectively alleviate these two problems. We show that DAT is a versatile method that enhances the performance of all three popular models, BEVFormer, DETR3D, and PETR. Our evaluation on BEVFormer demonstrates that DAT achieves a significant improvement of +2.8 NDS on nuScenes val under the same settings. Moreover, when using pre-trained VoVNet-99 as the backbone, DAT achieves strong results of 60.0 NDS and 51.5 mAP on nuScenes test. Our code will be soon.
翻译:本文提出DAT(Depth-Aware Transformer,深度感知Transformer框架),专为基于摄像头的3D检测设计。我们的模型基于对现有方法中两大主要问题的观察:较大的深度平移误差以及沿深度轴的重复预测。为缓解这些问题,我们在DAT中提出两项关键解决方案。针对第一个问题,我们引入深度感知空间交叉注意力(DA-SCA)模块,在将图像特征提升至3D空间时,将深度信息融入空间交叉注意力中。针对第二个问题,我们提出一项辅助学习任务,称为深度感知负抑制损失。首先,基于参考点,我们将特征组织为鸟瞰图(BEV)特征图;然后,沿连接物体与相机的物体射线分别采样正负特征,并训练模型区分二者。所提出的DA-SCA与DNS方法有效缓解了这两类问题。实验表明,DAT是一种通用方法,可提升BEVFormer、DETR3D和PETR这三种主流模型的性能。基于BEVFormer的评估显示,在相同设置下,DAT在nuScenes验证集上实现了+2.8 NDS的显著提升。此外,当采用预训练的VoVNet-99作为骨干网络时,DAT在nuScenes测试集上取得了60.0 NDS和51.5 mAP的优异结果。代码将稍后开源。