OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

翻译：同步定位与建图（SLAM）是机器人技术、增强现实/虚拟现实（AR/VR）以及自主系统中的基础组件。近年来，随着对空间人工智能的关注度日益提升，将SLAM与语义理解相结合对于实现智能感知与交互变得愈发重要。近期的研究探索了这种融合，但它们通常依赖于深度传感器或闭集语义模型，这限制了其在开放世界环境中的可扩展性和适应性。在本工作中，我们提出了OpenMonoGS-SLAM，这是首个将3D高斯溅射（3DGS）与开放集语义理解相统一的单目SLAM框架。为实现这一目标，我们利用了视觉基础模型（VFMs）的最新进展，包括用于视觉几何的MASt3R模型，以及用于开放词汇语义的SAM和CLIP模型。这些模型在多样化任务中展现出强大的泛化能力，从而实现了精确的单目相机跟踪与建图，以及对开放世界环境中语义的丰富理解。我们的方法无需任何深度输入或3D语义真值，仅依赖于自监督学习目标。此外，我们提出了一种专门设计用于管理高维语义特征的内存机制，该机制能有效构建高斯语义特征图，从而带来优异的整体性能。实验结果表明，我们的方法在闭集和开放集分割任务中均取得了与现有基线相当或更优的性能，且完全不依赖于深度图或语义标注等辅助传感器。