Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
翻译:近年来,视觉基础模型在二维视觉感知领域取得了革命性进展,然而其在三维场景理解,特别是自动驾驶应用中的潜力仍未得到充分探索。本文提出LargeAD,一个面向多样化真实世界驾驶数据集的大规模三维预训练通用可扩展框架。该框架利用视觉基础模型从二维图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量对比样本。这种对齐促进了跨模态表征学习,增强了二维与三维数据之间的语义一致性。我们引入了多项关键创新:i)基于视觉基础模型的超像素生成以实现精细语义表征;ii)视觉基础模型辅助的对比学习策略以对齐多模态特征;iii)超点时间一致性以保持跨时间稳定表征;iv)多源数据预训练以实现对不同激光雷达配置的泛化。在基于激光雷达的分割与目标检测任务中,我们的方法在线性探测与微调任务上均显著超越了现有最优方法。在十一个大规模多模态数据集上的广泛实验突显了本方法的优越性能,证明了其在真实自动驾驶场景中的适应性、高效性与鲁棒性。