Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.
翻译:通用且自适应的语义理解将使自主系统能够理解并与其环境进行交互。现有的固定类别模型限制了室内移动及辅助自主系统的适应性。在本文中,我们介绍LEXIS,一个实时室内同步定位与地图构建(SLAM)系统,它利用大型语言模型(LLMs)的开放词汇特性,创建了一种统一的场景理解与地点识别方法。该方法首先构建环境的拓扑SLAM图(使用视觉惯性里程计),并在图节点中嵌入对比语言-图像预训练(CLIP)特征。我们利用这一表示进行灵活的室分类与分割,为以房间为中心的地点识别奠定基础。这使得闭环检测能够指向语义相关的地点。我们使用公开的模拟数据和真实世界数据(涵盖办公室和家庭环境)对所提出的系统进行评估。该系统成功地对不同布局和尺寸的房间进行分类,并超越了当前最先进的技术(SOTA)。在地点识别和轨迹估计任务中,我们实现了与SOTA相当的性能,且均使用相同的预训练模型。最后,我们展示了该系统在规划方面的潜力。