Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.
翻译:语义场景补全(SSC)旨在从单目图像推断完整的三维几何与语义,是基于摄像头的自动驾驶感知中的关键能力。然而,现有依赖时序堆叠或深度投影的SSC方法往往缺乏显式的运动推理,且在遮挡与噪声深度监督下表现不佳。我们提出CurriFlow,一种新颖的语义占据预测框架,它将基于光流的时序对齐与课程引导的深度融合相结合。CurriFlow采用多级融合策略,利用预训练光流跨帧对齐分割特征、视觉特征与深度特征,从而提升时序一致性及动态物体理解能力。为增强几何鲁棒性,课程学习机制在训练过程中逐步从稀疏但精确的激光雷达深度过渡到稠密但带噪声的立体深度,确保优化稳定性并实现向实际部署的无缝适配。此外,来自Segment Anything Model(SAM)的语义先验提供了类别无关的监督,强化了体素级语义学习与空间一致性。在SemanticKITTI基准测试上的实验表明,CurriFlow以16.9的平均交并比实现了最先进的性能,验证了我们面向基于摄像头的三维语义场景补全所提出的运动引导与课程感知设计的有效性。