Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively. And for nuScenes dataset, our performance is 26.8% with an improvement of 6%. Code will be released (https://github.com/runnanchen/Label-Free-Scene-Understanding).
翻译:视觉基础模型(如对比视觉-语言预训练模型CLIP和分段任意模型SAM)在图像分类和分割任务中已展现出令人瞩目的零样本性能。然而,将CLIP与SAM结合用于无标签场景理解的研究尚属空白。本文探索了视觉基础模型在无标注数据条件下使网络理解2D和3D世界的潜力。核心挑战在于如何有效监督网络学习由CLIP生成的极端噪声伪标签——该噪声在从2D域传播至3D域的过程中进一步加剧。为此,我们提出了一种新颖的跨模态噪声监督方法(CNS),该方法通过协同利用CLIP与SAM的优势同时监督2D和3D网络。具体而言,我们引入预测一致性正则化机制对2D和3D网络进行协同训练,并进一步利用SAM鲁棒的特征表示约束网络的潜在空间一致性。在多样化室内外数据集上的实验表明,本方法在理解2D和3D开放环境方面具有卓越性能。基于ScanNet数据集,我们的2D和3D网络分别实现了28.4%和33.5%的mIoU无标签语义分割精度,分别提升4.7%和7.9%;在nuScenes数据集上,本方法以26.8%的mIoU实现了6%的性能提升。相关代码将开源发布(https://github.com/runnanchen/Label-Free-Scene-Understanding)。