3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis--where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.
翻译:三维语义占据栅格预测是自动驾驶领域的一项关键任务,旨在提供对周围环境的稠密且细粒度的理解。然而,单模态方法在相机语义与激光雷达几何信息之间面临权衡。现有的多模态框架常受限于模态异构性、空间错位以及表征危机——体素计算量大而鸟瞰图替代方案存在信息损失。本文提出GaussianOcc3D,一个通过内存高效、连续的三维高斯表征来桥接相机与激光雷达的多模态框架。我们引入了四个模块:(1) 激光雷达深度特征聚合(LDFA),利用深度可变形采样将稀疏信号提升至高斯基元;(2) 基于熵的特征平滑(EBFS),以减轻域噪声;(3) 自适应相机-激光雷达融合(ACLF),采用不确定性感知重加权机制处理传感器可靠性;(4) 一个利用选择性状态空间模型实现全局上下文且具备线性复杂度的Gauss-Mamba头部。在Occ3D、SurroundOcc和SemanticKITTI基准测试上的评估表明,该方法实现了最先进的性能,分别取得了49.4%、28.9%和25.2%的平均交并比分数。GaussianOcc3D在具有挑战性的雨天和夜间条件下展现出卓越的鲁棒性。