Neural Radiance Fields (NeRFs) have become a powerful tool for modeling 3D scenes from multiple images. However, NeRFs remain difficult to segment into semantically meaningful regions. Previous approaches to 3D segmentation of NeRFs either require user interaction to isolate a single object, or they rely on 2D semantic masks with a limited number of classes for supervision. As a consequence, they generalize poorly to class-agnostic masks automatically generated in real scenes. This is attributable to the ambiguity arising from zero-shot segmentation, yielding inconsistent masks across views. In contrast, we propose a method that is robust to inconsistent segmentations and successfully decomposes the scene into a set of objects of any class. By introducing a limited number of competing object slots against which masks are matched, a meaningful object representation emerges that best explains the 2D supervision and minimizes an additional regularization term. Our experiments demonstrate the ability of our method to generate 3D panoptic segmentations on complex scenes, and extract high-quality 3D assets from NeRFs that can then be used in virtual 3D environments.
翻译:神经辐射场(NeRFs)已成为从多幅图像建模三维场景的强大工具。然而,NeRFs仍难以分割为具有语义意义的区域。现有的NeRF三维分割方法要么需要用户交互以分离单个物体,要么依赖有限类别数的二维语义掩码进行监督。因此,这些方法对真实场景中自动生成的类别无关掩码泛化能力较差。这归因于零样本分割产生的歧义性,导致跨视角掩码不一致。相比之下,我们提出了一种对不一致分割具有鲁棒性、并能成功将场景分解为任意类别物体集合的方法。通过引入有限数量的竞争性物体槽与掩码进行匹配,可产生能最佳解释二维监督信号并最小化附加正则化项的有意义物体表征。实验证明,我们的方法能够在复杂场景中生成三维全景分割,并从NeRFs中提取高质量的三维资产,进而应用于虚拟三维环境。