A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.
翻译:可导航智能体需同时理解高层语义指令与精确空间感知。基于多模态大语言模型(MLLMs)构建导航智能体因其强大的泛化能力展现出广阔前景,但当前紧耦合设计严重制约系统性能。本研究提出解耦设计,将低层空间状态估计与高层语义规划相分离。不同于依赖预定义、过度简化文本地图的既有方法,我们引入交互式度量世界表征,该表征保持丰富且一致的信息,使MLLMs能够与之交互并进行决策推理。进一步引入反事实推理以激发MLLMs潜能,而度量世界表征确保生成动作的物理有效性。我们在仿真与真实环境中开展全面实验:本方法在零样本条件下刷新最佳性能,于R2R-CE和RxR-CE基准测试中分别达到48.8%和42.2%的成功率。为验证度量表征的普适性,我们展示了跨异构载具的零样本仿真到现实迁移能力,涵盖轮式TurtleBot 4与定制空中无人机。这些真实场景部署证实,本解耦框架可作为具身视觉语言导航领域稳健的领域无关接口。