Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
翻译:现有的三维开放词汇场景理解方法大多侧重于将二维基础模型的语言特征提取到三维特征场中,但很大程度上忽视了场景外观、语义和几何之间的协同作用。因此,场景理解常常偏离场景的底层几何结构,并与重建过程脱节。在本工作中,我们提出了一种新颖的方法,利用语言和几何驱动的稀疏体素表示,在统一框架内全面建模外观、语义和几何。具体而言,我们使用三维稀疏体素作为基元,并采用外观场、密度场、特征场和置信度场来整体表示三维场景。为了促进外观场、密度场和特征场之间的协同,我们构建了一个特征调制模块,并将二维基础模型的语言特征提取到我们的三维场景模型中。此外,我们将几何提取整合到特征场提取中,通过深度相关性正则化和模式一致性正则化,将几何知识从几何基础模型转移到我们的三维场景表示中。这些组件协同工作,在统一框架内对三维场景的外观、语义和几何进行协同建模。大量实验表明,与整体场景理解和重建领域的最先进方法相比,我们的方法实现了更优的综合性能。