The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.
翻译:大规模三维-文本语料库的缺失,促使近期研究尝试从视觉-语言模型中蒸馏开放词汇知识。然而,这些方法通常依赖单一视觉-语言模型,将三维模型的特征空间对齐至一个共同的语言空间,这限制了三維模型利用各种基础模型所封装的不同空间与语义能力的潜力。本文提出用于开放词汇三维场景理解的跨模态与不确定性感知聚合方法,简称CUA-O3D,这是首个将多种基础模型——如CLIP、DINOv2和Stable Diffusion——集成到三维场景理解中的模型。我们进一步引入确定性不确定性估计,以自适应地蒸馏并协调来自这些模型的异构二维特征嵌入。我们的方法解决了两个关键挑战:(1)将来自视觉-语言模型的语义先验与具有空间感知能力的视觉基础模型的几何知识相结合;(2)使用一种新颖的确定性不确定性估计来捕捉不同语义和几何敏感性下的模型特定不确定性,有助于在训练过程中协调异构表示。在ScanNetV2和Matterport3D上进行的大量实验表明,我们的方法不仅推进了开放词汇分割任务,还实现了鲁棒的跨域对齐和具有竞争力的空间感知能力。代码将在以下网址提供:https://github.com/TyroneLi/CUA_O3D。