Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature significantly as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting the open-vocabulary multimodal knowledge and object reasoning capability of pre-trained foundation models CLIP and DINO, without necessitating any fine-tuning. Specifically, we distill open-vocabulary visual and textual knowledge from CLIP into a neural radiance field (NeRF) which effectively lifts 2D features into view-consistent 3D segmentation. Furthermore, we introduce the Relevancy-Distribution Alignment loss and Feature-Distribution Alignment loss to respectively mitigate the ambiguities of CLIP features and distill precise object boundaries from DINO features, eliminating the need for segmentation annotations during training. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs.
翻译:开放词汇的三维场景分割是人类感知的基本功能,因此也是计算机视觉研究的关键目标。然而,该任务严重受限于缺乏大规模、多样化的三维开放词汇分割数据集以训练鲁棒且泛化能力强的模型。从预训练的二维开放词汇分割模型中蒸馏知识虽有助益,但会显著削弱开放词汇特征,因为二维模型大多使用封闭词汇数据集进行微调。我们通过利用预训练基础模型CLIP和DINO的开放词汇多模态知识与物体推理能力,在不需任何微调的前提下解决三维开放词汇分割挑战。具体而言,我们将CLIP中的开放词汇视觉与文本知识蒸馏至神经辐射场(NeRF),有效将二维特征提升为视角一致的三维分割。此外,我们引入相关性分布对齐损失和特征分布对齐损失,分别缓解CLIP特征的歧义性并从DINO特征中蒸馏精确的物体边界,从而在训练过程中无需分割标注。大量实验表明,我们的方法甚至优于使用分割标注训练的完全监督模型,这证明三维开放词汇分割可通过二维图像与图文对有效学习。