3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
翻译:三维高斯泼溅(3DGS)已成为一种强大的新视角合成技术,随后扩展到众多空间人工智能应用中。然而,现有的大多数3DGS方法相互孤立,专注于特定领域,如在线SLAM、语义增强或针对无位姿图像的3DGS。本文提出X-GS,一个可扩展的开放框架,它统一了广泛的技术,以实现基于3DGS的、富含语义的实时在线SLAM,并弥合了与下游多模态模型之间的鸿沟。X-GS的核心是一个名为X-GS-Perceiver的高效流水线,能够以无位姿RGB(或可选的RGB-D)视频流作为输入,共同优化几何与位姿,并将来自视觉基础模型的高维语义特征提炼到三维高斯表示中。我们通过新颖的在线向量量化(VQ)模块、GPU加速的网格采样方案以及高度并行化的流水线设计,实现了实时性能。这些语义化的三维高斯随后可在X-GS-Thinker组件中被视觉-语言模型利用,以支持下游任务,如物体检测、零样本字幕生成以及潜在的具身任务。在真实世界数据集上的实验结果展示了X-GS框架的有效性、高效性及其新解锁的多模态能力。