We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".
翻译:我们提出一种简单而有效的方法——多视角注意力上下文化(MvACon),用于改进基于查询的多视角3D(MV3D)目标检测中2D到3D特征的提升过程。尽管基于查询的MV3D目标检测领域取得了显著进展,但现有方法通常面临两难困境:要么因计算成本过高而无法在密集注意力提升中利用高分辨率2D特征,要么因稀疏注意力提升中3D查询对多尺度2D特征的定位不够密集。我们提出的MvACon通过一种在表示上密集但计算上稀疏的注意力特征上下文化方案实现一举两得,该方案与具体的2D到3D特征提升方法无关。在实验中,我们在nuScenes基准上对MvACon进行了全面测试,使用了BEVFormer及其最近提出的三维可变形注意力(DFA3D)变体以及PETR,结果显示检测性能持续提升,尤其在位置、方向和速度预测方面表现突出。在Waymo-mini基准上使用BEVFormer的测试也取得了类似改进。我们定性和定量地证明,基于全局聚类的上下文能够有效编码MV3D目标检测所需的密集场景级上下文。我们提出的MvACon的优异结果再次印证了计算机视觉中的格言——“(上下文化的)特征至关重要”。