3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer.
翻译:三维语义占据预测旨在获取周围场景的三维细粒度几何与语义信息,是提升以视觉为中心的自动驾驶系统鲁棒性的关键任务。现有方法大多采用体素等密集网格作为场景表示,忽略了占据的稀疏性及物体尺度的多样性,导致计算资源分配不均。为此,我们提出一种以物体为中心的表示方法,用稀疏的三维语义高斯模型描述三维场景,其中每个高斯单元代表一个灵活的兴趣区域及其语义特征。我们通过注意力机制从图像中聚合信息,并迭代优化三维高斯单元的位置、协方差及语义属性。随后,我们提出一种高效的高斯至体素投射方法以生成三维占据预测,该方法仅聚合特定位置邻域内的高斯单元。我们在广泛使用的 nuScenes 和 KITTI-360 数据集上进行了大量实验。结果表明,GaussianFormer 在仅消耗基准方法 17.8% - 24.8% 内存的情况下,取得了与最先进方法相当的性能。代码已开源:https://github.com/huang-yh/GaussianFormer。