Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.
翻译:理解多层抽象层面的复杂场景仍是计算机视觉中的重大挑战。为此,我们提出嵌套神经特征场(N2F2),该方法通过层级化监督学习单一特征场,使得同一高维特征中不同维度编码不同粒度的场景属性。我们的方法支持灵活定义层级结构(可基于物理维度、语义或两者结合),从而实现对场景的综合细粒度理解。我们利用二维类无关分割模型在图像空间提供任意尺度的语义像素分组,并通过查询CLIP视觉编码器获取每个分组语言对齐的嵌入向量。所提出的层级化监督方法通过在不同物理尺度下采用延迟体渲染技术,将特征场中不同嵌套维度分配给CLIP嵌入向量的蒸馏过程,从而构建由粗到精的表示。大量实验表明,在开放词汇三维分割与定位等任务中,我们的方法显著优于现有最优特征场蒸馏方法,验证了所提嵌套特征场的学习有效性。