Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.
翻译:在计算机视觉领域,如何从多个抽象层次理解复杂场景仍然是一个巨大的挑战。为此,我们提出了嵌套神经特征场(N2F2),这是一种新颖的方法,它利用层次化监督来学习一个单一的特征场。在该特征场中,同一个高维特征内的不同维度编码了不同粒度下的场景属性。我们的方法允许灵活地定义层次结构,可以根据物理尺度、语义信息或两者共同定义,从而实现全面且细致的场景理解。我们利用一个与类别无关的2D分割模型,在图像空间中提供任意尺度下具有语义意义的像素分组,并通过查询CLIP视觉编码器为每个分割区域获取与语言对齐的嵌入向量。我们提出的层次化监督方法,则将特征场的不同嵌套维度分配给这些CLIP嵌入,并利用延迟体渲染在不同物理尺度上进行知识蒸馏,从而创建出一个从粗到细的表征。大量实验表明,在开放词汇3D分割和定位等任务上,我们的方法优于最先进的特征场蒸馏方法,证明了所学习的嵌套特征场的有效性。