Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (${M^2D}$NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-modal feature distillation to integrate teacher features from pretrained visual and language models into 3D semantic feature volumes, thereby facilitating consistent 3D editing. To enforce consistency between the visual and language features in our 3D feature volumes, we introduce a multi-modal similarity constraint. We also introduce a patch-based joint contrastive loss that helps to encourage object-regions to coalesce in the 3D feature space, resulting in more precise boundaries. Experiments on various real-world scenes show superior performance in 3D scene decomposition tasks compared to prior NeRF-based methods.
翻译:神经辐射场(NeRF)已成为表示连续三维场景的一种有前景的方法。然而,NeRF缺乏语义编码,这给场景分解带来了重大挑战。为解决这一挑战,我们提出了一个单一模型——多模态分解神经辐射场(${M^2D}$NeRF),该模型能够同时支持基于文本和基于视觉区域块的编辑。具体而言,我们采用多模态特征蒸馏技术,将来自预训练视觉模型和语言模型的教师特征整合到3D语义特征体中,从而促进一致的3D编辑。为保证3D特征体中视觉特征与语言特征之间的一致性,我们引入了一种多模态相似性约束。此外,我们提出了一种基于区域块的联合对比损失函数,有助于在3D特征空间中促使目标区域聚合,从而产生更精确的边界。在多种真实场景上的实验表明,与先前基于NeRF的方法相比,该方法在3D场景分解任务中表现更优。