The Bird's-eye View (BeV) representation is widely used for 3D perception from multi-view camera images. It allows to merge features from different cameras into a common space, providing a unified representation of the 3D scene. The key component is the view transformer, which transforms image views into the BeV. However, actual view transformer methods based on geometry or cross-attention do not provide a sufficiently detailed representation of the scene, as they use a sub-sampling of the 3D space that is non-optimal for modeling the fine structures of the environment. In this paper, we propose GaussianBeV, a novel method for transforming image features to BeV by finely representing the scene using a set of 3D gaussians located and oriented in 3D space. This representation is then splattered to produce the BeV feature map by adapting recent advances in 3D representation rendering based on gaussian splatting. GaussianBeV is the first approach to use this 3D gaussian modeling and 3D scene rendering process online, i.e. without optimizing it on a specific scene and directly integrated into a single stage model for BeV scene understanding. Experiments show that the proposed representation is highly effective and place GaussianBeV as the new state-of-the-art on the BeV semantic segmentation task on the nuScenes dataset.
翻译:鸟瞰图(BeV)表征广泛应用于多视角相机图像的三维感知。它能够将来自不同相机的特征融合至同一空间,提供三维场景的统一表征。其核心组件是视角变换器,负责将图像视角转换为鸟瞰图。然而,现有基于几何或交叉注意力的视角变换方法未能提供足够精细的场景表征,因为它们采用的三维空间子采样方案在环境精细结构建模方面并非最优。本文提出GaussianBeV——一种通过三维空间中定位和定向的高斯集合精细表征场景,从而将图像特征转换为鸟瞰图的新方法。该方法通过适配基于高斯泼溅的三维表征渲染最新进展,将这种表征泼溅生成鸟瞰图特征图。GaussianBeV是首个在线应用这种三维高斯建模与三维场景渲染流程的方法,即无需针对特定场景进行优化,可直接集成至单阶段鸟瞰图场景理解模型。实验表明,所提表征方法具有高效性,使GaussianBeV在nuScenes数据集的鸟瞰图语义分割任务中达到新的最优性能。