Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.

翻译：高效利用数据对于推动自动驾驶中的三维场景理解至关重要，而当前依赖大量人工标注的激光雷达点云数据对全监督方法构成挑战。针对这一问题，本研究将半监督学习拓展至激光雷达语义分割领域，利用驾驶场景的内在空间先验与多传感器互补性提升未标注数据集的利用效率。我们提出LaserMix++框架，该框架通过融合来自不同激光雷达扫描的光束操作，并引入激光雷达-相机对应关系以进一步辅助数据高效学习。该框架专为通过多模态信息增强三维场景一致性正则化而设计，具体包括：1）多模态LaserMix操作实现细粒度跨传感器交互；2）相机到激光雷达的特征蒸馏增强激光雷达特征学习；3）语言驱动的知识引导利用开放词汇模型生成辅助监督信号。LaserMix++的通用性使其可应用于多种激光雷达表示，成为普适性解决方案。通过理论分析与在主流驾驶感知数据集上的大量实验，我们严格验证了该框架的有效性。结果表明，LaserMix++显著优于全监督方法，在标注量减少五倍的情况下仍能达到可比精度，并大幅提升纯监督基线性能。这一突破性进展充分展现了半监督方法在减少激光雷达三维场景理解系统对大量标注数据依赖方面的潜力。