In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformer-like architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41\% mAP on average, and up to +15.1\% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEA-Research/3D-deformable-attention.git.
翻译:本文提出一种名为3D可变形注意力(DFA3D)的新型算子,用于实现2D到3D的特征提升,将多视图2D图像特征统一转换至3D空间以完成3D目标检测。现有特征提升方法(如基于Lift-Splat和基于2D注意力的方法)要么利用估计深度生成伪激光雷达特征后将其投射至3D空间(该操作为一次性过程且无特征精化),要么忽略深度而通过2D注意力机制提升特征(虽实现更优语义但存在深度模糊问题)。相比之下,本文提出的DFA3D方法首先利用估计深度将各视图的2D特征图扩展至3D,再通过DFA3D从扩展后的3D特征图中聚合特征。借助DFA3D,深度模糊问题可从根源上有效缓解,且得益于类似Transformer的架构,提升特征可逐层渐进精化。此外,我们提出DFA3D的数学等价实现,可显著提升其内存效率与计算速度。我们将DFA3D集成至多个基于2D注意力特征提升的方法中(仅需少量代码修改),并在nuScenes数据集上评估。实验结果表明,平均mAP稳定提升1.41%,当高质量深度信息可用时,mAP提升最高达15.1%,充分证明DFA3D的优越性、适用性与巨大潜力。代码详见https://github.com/IDEA-Research/3D-deformable-attention.git。