VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

In recent years, transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks. However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors. In this work, we investigate this issue from the perspective of bird's-eye-view (BEV) feature generation. Specifically, we examine the BEV feature generation method employed by the transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it only generates attention weights from BEV, which precludes the use of lidar points for supervision, and (ii) it aggregates camera view features to the BEV through deformable sampling, which only selects a small subset of features and fails to exploit all information. To overcome these limitations, we propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view. This method encodes all camera features into the BEV feature. By combining dual-view attention with the BEVFormer architecture, we build a new detector named VoxelFormer. Extensive experiments are conducted on the nuScenes benchmark to verify the superiority of dual-view attention and VoxelForer. We observe that even only adopting 3 encoders and 1 historical frame during training, VoxelFormer still outperforms BEVFormer significantly. When trained in the same setting, VoxelFormer can surpass BEVFormer by 4.9% NDS point. Code is available at: https://github.com/Lizhuoling/VoxelFormer-public.git.

翻译：近年来，基于Transformer的检测器在2D视觉感知任务中展现出卓越性能。然而，其在多视角3D目标检测中的表现仍逊于基于卷积神经网络的最先进（SOTA）检测器。本研究从鸟瞰图（BEV）特征生成的角度探究该问题。具体而言，我们分析了基于Transformer的SOTA方法BEVFormer所采用的BEV特征生成方式，并发现其存在两个局限性：（i）该方法仅从BEV视角生成注意力权重，导致无法利用激光雷达点云进行监督；（ii）通过可变形采样将相机视角特征聚合到BEV时，仅选取少量特征子集，未能充分利用所有信息。为克服这些局限，我们提出了一种新型BEV特征生成方法——双视角注意力机制，该方法同时从BEV和相机视角生成注意力权重，并将所有相机特征编码至BEV特征中。通过将双视角注意力与BEVFormer架构结合，我们构建了名为VoxelFormer的新型检测器。在nuScenes基准上开展了大量实验以验证双视角注意力与VoxelFormer的优越性。实验表明，即使训练时仅采用3个编码器与1帧历史数据，VoxelFormer仍显著优于BEVFormer。在相同训练设置下，VoxelFormer的NDS指标可超越BEVFormer 4.9个百分点。代码开源地址：https://github.com/Lizhuoling/VoxelFormer-public.git。