Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
翻译:跨模态知识迁移增强了激光雷达语义分割中的点云表示学习。尽管潜力巨大,但车载摄像头图像存在重复性高、多样性不足以及地面真值标签稀疏且不准确的问题,由此引发了“弱教师挑战”。为此,我们提出了高效图像至激光雷达知识迁移(ELiTe)范式。ELiTe引入了“块到点多阶段知识蒸馏”方法,将在大规模多样化开放世界图像上训练的视觉基础模型(VFM)中的全面知识迁移至轻量级学生模型,实现了跨模态的有效知识传递。ELiTe采用参数高效微调技术强化VFM教师模型,并以极低的成本加速大规模模型训练。此外,我们提出了基于“分割万物模型”的伪标签生成方法,以提升低质量图像标签的可靠性,从而促进鲁棒语义表征的构建。ELiTe的高效知识迁移在SemanticKITTI基准上取得了当前最优结果,超越了实时推理模型。我们的方法在参数数量显著减少的情况下仍能达到此性能,验证了其有效性与高效性。