Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last mile problem in zero-shot navigation determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on the challenging GOAT-Bench and HM3D-ObjNav benchmark. The code will be publicly available at https://github.com/ylwhxht/MSGNav.
翻译:具身导航是机器人智能体操作的基本能力。实际部署需要开放词汇泛化能力和低训练开销,这促使人们采用零样本方法而非任务特定的强化学习训练。然而,现有构建显式三维场景图的零样本方法通常将丰富的视觉观测压缩为纯文本关系,导致构建成本高、视觉证据不可逆丢失以及词汇受限。为解决这些局限性,我们引入了多模态三维场景图(M3DSG),它通过用动态分配的图像替换文本关系边来保留视觉线索。基于M3DSG,我们提出了MSGNav——一个零样本导航系统,包含用于高效推理的关键子图选择模块、用于开放词汇支持的自适应词汇更新模块,以及用于精确探索推理的闭环推理模块。此外,我们进一步识别了零样本导航中的“最后一英里”问题,即确定具有合适最终视点的可行目标位置,并提出了基于可见性的视点决策模块来显式解决该问题。综合实验结果表明,MSGNav在具有挑战性的GOAT-Bench和HM3D-ObjNav基准测试中取得了最先进的性能。代码将在 https://github.com/ylwhxht/MSGNav 公开提供。