3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
翻译:三维高斯泼溅技术近期因其高效的训练和实时渲染能力而备受关注。虽然经典的高斯泼溅表示主要设计用于视图合成,但近期的研究探索了如何将其与场景理解和语言特征相结合。然而,现有方法缺乏对场景的细致理解,限制了其分割和解析复杂结构的能力。为此,我们提出SuperGSeg,一种通过解耦分割与语言场蒸馏来构建具有内聚性和上下文感知的场景表示的新方法。SuperGSeg首先利用神经高斯分布,借助现成的二维掩码从多视角图像中学习实例及层次化分割特征。这些特征随后被用于创建一组稀疏的、我们称之为“超高斯”的表示。超高斯分布促进了二维语言特征向三维空间的蒸馏。通过超高斯表示,我们的方法能够在无需大幅增加GPU内存的情况下实现高维语言特征的渲染。大量实验表明,SuperGSeg在开集词汇目标定位和语义分割任务上均优于现有方法。