Recently, large-scale pre-trained models such as Segment-Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkable success and revolutionized the field of computer vision. These foundation vision models effectively capture knowledge from a large-scale broad data with their vast model parameters, enabling them to perform zero-shot segmentation on previously unseen data without additional training. While they showcase competence in 2D tasks, their potential for enhancing 3D scene understanding remains relatively unexplored. To this end, we present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. We then project these mask predictions from various frames of RGB-D video sequences into 3D space. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting. We examine diverse scenarios, like zero-shot learning and limited guidance from sparse 2D point labels, to assess the pros and cons of different vision foundation models. Our approach is experimented on ScanNet dataset for 3D indoor scenes, and the results demonstrate the effectiveness of adopting general 2D foundation models on solving 3D point cloud segmentation tasks.
翻译:近期,大规模预训练模型如Segment-Anything Model (SAM)和Contrastive Language-Image Pre-training (CLIP)展现出卓越性能,彻底革新了计算机视觉领域。这些视觉基础模型通过其庞大的模型参数从海量广泛数据中有效捕获知识,能够在无需额外训练的情况下对未见数据执行零样本分割。尽管它们在二维任务中表现优异,但其对三维场景理解的提升潜力尚未得到充分探索。为此,我们提出一种新颖框架,将多种基础模型适配至三维点云分割任务。该方法首先利用不同大型视觉模型生成二维语义掩码的初始预测,然后将来自RGB-D视频序列各帧的掩码预测投影至三维空间。为生成稳健的三维语义伪标签,我们引入一种语义标签融合策略,通过投票机制有效整合所有结果。我们考察了零样本学习及稀疏二维点标签有限引导等多种场景,以评估不同视觉基础模型的优劣。该方法在用于三维室内场景的ScanNet数据集上进行实验,结果证明了将通用二维基础模型应用于三维点云分割任务的有效性。