Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.
翻译:场景图正成为机器人导航的标准表示方式,可提供层次化的几何与语义场景理解。然而,现有场景图建图方法大多依赖深度相机或激光雷达传感器。本文提出LEXI-SG——首个仅利用RGB相机输入的密集单目视觉建图系统,用于构建开放词汇3D场景图。该方法利用开放词汇基础模型的语义先验将场景划分为房间,并在每个房间被完整观测后才进行前馈重建,从而在避免滑动窗口尺度不一致的前提下实现可扩展的密集建图。我们提出基于房间的因子图框架,在保持局部地图一致性的同时全局对齐房间重建结果,并自然构建语义场景图层次结构。在每个房间内,我们还支持开放词汇目标分割与跟踪。在Habitat-Matterport 3D室内场景及自采第一人称办公室序列上的实验表明:与现有前馈SLAM方法及标准场景图基线相比,LEXI-SG在轨迹估计与密集重建方面表现更优,同时在开放词汇分割任务上具有竞争力。LEXI-SG证明了仅凭单目RGB即可实现精确、可扩展的开放词汇3D场景图。项目主页与办公室序列详见:https://ori-drs.github.io/lexisg-web/。