Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, eliminating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets.
翻译:近期视觉基础模型(VFMs)的进展为通用高效视觉感知开辟了新可能。本文提出Seal这一创新框架,利用VFMs分割多样化自动驾驶点云序列。Seal具有三个吸引人的特性:i)可扩展性:VFMs被直接蒸馏到点云中,预训练阶段无需2D或3D标注;ii)一致性:在相机到激光雷达及点到片段两个阶段强制实施时空关系约束,促进跨模态表示学习;iii)泛化性:Seal能够以开箱即用方式将知识迁移至涉及多样化点云的下游任务,涵盖真实/合成、低/高分辨率、大/小尺度及干净/损坏数据集。在十一个不同点云数据集上的广泛实验展示了Seal的有效性与优越性。值得注意的是,Seal在nuScenes数据集上经线性探测后达到45.0% mIoU,较随机初始化提升36.9% mIoU,并超越先前最优方法6.1% mIoU。此外,在所有十一个点云数据集的20项不同少样本微调任务中,Seal均展现出优于现有方法的显著性能提升。