Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
翻译:视觉基础模型(VFMs)的最新进展已彻底改变了二维视觉感知,但其在三维场景理解(尤其是在自动驾驶应用)中的潜力仍未得到充分探索。本文提出LargeAD,一个通用且可扩展的框架,专为跨多样真实世界驾驶数据集的大规模三维预训练而设计。我们的框架利用VFMs从二维图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量对比样本。这种对齐促进了跨模态表征学习,增强了二维与三维数据之间的语义一致性。我们引入了多项关键创新:(i)基于VFM的超像素生成以实现精细语义表征,(ii)一种VFM辅助的对比学习策略以对齐多模态特征,(iii)超点时间一致性以保持跨时间的稳定表征,以及(iv)多源数据预训练以泛化至不同激光雷达配置。我们的方法在基于激光雷达的分割与目标检测的线性探测和微调任务中,相比现有最先进方法取得了显著提升。在11个大规模多传感器数据集上的广泛实验突显了我们卓越的性能,证明了其在真实世界自动驾驶场景中的适应性、高效性和鲁棒性。