3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While its vanilla representation is mainly designed for view synthesis, recent works extended it to scene understanding with language features. However, storing additional high-dimensional features per Gaussian for semantic information is memory-intensive, which limits their ability to segment and interpret challenging scenes. To this end, we introduce SuperGSeg, a novel approach that fosters cohesive, context-aware hierarchical scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural 3D Gaussians to learn geometry, instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of \acrlong{superg}s. \acrlong{superg}s facilitate the lifting and distillation of 2D language features into 3D space. They enable hierarchical scene understanding with high-dimensional language feature rendering at moderate GPU memory costs. Extensive experiments demonstrate that SuperGSeg achieves remarkable performance on both open-vocabulary object selection and semantic segmentation tasks.
翻译:三维高斯泼溅技术近期因其高效的训练与实时渲染能力而受到广泛关注。其原始表示主要面向视图合成任务,而近期研究已将其扩展至结合语言特征的场景理解领域。然而,为每个高斯点存储额外的高维特征以承载语义信息会消耗大量内存,这限制了现有方法在复杂场景分割与解析方面的能力。为此,我们提出SuperGSeg——一种通过解耦分割任务与语言场蒸馏来构建具有内聚性和上下文感知能力的层次化场景表示的新方法。SuperGSeg首先利用神经三维高斯分布,借助现成的二维掩码从多视角图像中学习几何、实例及层次化分割特征。随后,这些特征被用于构建一组稀疏的\acrlong{superg}s。\acrlong{superg}s实现了二维语言特征向三维空间的提升与蒸馏过程,使得系统能够在可控的GPU内存开销下,通过渲染高维语言特征实现层次化场景理解。大量实验表明,SuperGSeg在开集词汇目标选取与语义分割任务上均取得了卓越的性能。