3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE
翻译:3D语义占据预测提供了一种直观高效的场景理解方式,在自动驾驶感知领域引起了广泛关注。现有方法要么依赖需要昂贵体素级标注的全监督,要么依赖提供有限指导且性能欠佳的自监督。为应对这些挑战,我们提出OccLE,一种标签高效的3D语义占据预测方法,以图像和激光雷达作为输入,在有限体素标注下仍保持高性能。我们的核心思路是将语义学习与几何学习任务解耦,然后融合两个任务学习到的特征网格进行最终语义占据预测。具体而言,语义分支通过蒸馏2D基础模型为2D和3D语义学习提供对齐的伪标签。几何分支基于图像与激光雷达的内在特性,通过跨平面协同机制融合多模态输入,并采用半监督策略增强几何学习。我们通过Dual Mamba融合语义-几何特征网格,并引入散射累积投影机制,利用对齐伪标签监督未标注区域的预测。实验表明,在SemanticKITTI和Occ3D-nuScenes数据集上,OccLE仅需10%的体素标注即可达到具有竞争力的性能。代码将发布于https://github.com/NerdFNY/OccLE。