The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.
翻译:鸟瞰图表示是直接影响三维目标检测性能的关键因素,但传统的鸟瞰图网格表示会随着空间分辨率的增加产生二次计算成本。为解决这一限制,我们提出了一种基于相机的新型三维目标检测器,采用高分辨率矢量表示:VectorFormer。所提出的高分辨率矢量表示与低分辨率鸟瞰图表示相结合,通过我们设计的两个新颖模块——矢量散射与矢量汇聚——高效地从多相机图像中提取高分辨率三维几何信息。由此,学习到的具有更丰富场景上下文的矢量表示可作为最终预测的解码查询。我们在nuScenes数据集上进行了大量实验,在NDS指标和推理时间方面均达到了最先进的性能。此外,我们研究了将所提出的矢量表示融入基于查询-鸟瞰图的方法,并观察到持续的性能提升。