OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.

翻译：现有的3D语义占据预测解决方案通常将该任务视为一次性3D体素分割感知问题。这些判别性方法专注于单步学习输入与占据图之间的映射，缺乏逐步细化占据图的能力以及补全局部区域的合理场景想象能力。本文提出OccGen——一种简单而强大的生成式感知模型，用于3D语义占据预测任务。OccGen采用“噪声到占据”的生成范式，通过预测并消除源自随机高斯分布的噪声，逐步推理并优化占据图。OccGen包含两个核心组件：能够处理多模态输入的条件编码器，以及利用多模态特征作为条件进行扩散去噪的渐进式细化解码器。该生成式流程的关键洞察在于：扩散去噪过程天然具备对稠密3D占据图进行从粗到细建模的能力，从而生成更精细的预测结果。在多个占据基准数据集上的大量实验表明，与现有最优方法相比，所提方法具有显著优势。例如，在nuScenes-Occupancy数据集的多模态、仅激光雷达、仅摄像头设置下，OccGen分别将mIoU相对提升了9.5%、6.3%和13.3%。此外，作为生成式感知模型，OccGen展现出判别性模型无法实现的理想特性，例如在多次预测过程中提供不确定性估计。