城市野外导航：探索多模态大语言模型中基于网络规模知识的新兴导航能力 (City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs)

Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path(VoP), which explicitly grounds the agent's internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

翻译：利用多模态大语言模型（MLLMs）开发具身智能体，为解决复杂现实世界任务提供了重要前景。然而，当前评估基准仍主要侧重于语言能力或严重依赖模拟环境，很少探究实际现实场景所必需的、精细且知识密集的推理能力。为弥合这一关键差距，我们提出了稀疏接地视觉导航任务，该任务明确设计用于评估MLLMs在具有挑战性、知识密集的现实环境中的序列决策能力。我们通过CityNav基准实现该任务，该基准涵盖四个全球多样化城市，专门构建用于评估原始MLLM驱动智能体在城市导航中的表现。智能体需完全依赖视觉输入和内部多模态推理，在超过50个决策点上进行序列化导航，无需额外环境标注或专用架构修改。关键在于，智能体必须通过解读城市特定线索和识别地标实现自主定位，执行空间推理，并战略性地规划及执行通往目的地的路线。通过广泛评估，我们发现当前最先进的MLLMs、推理技术（如GEPA、思维链、反思机制）以及竞争性基线方法PReP在这一挑战性场景中均表现显著不足。为此，我们提出路径言语化方法，通过从MLLM中提取城市尺度认知地图（关键地标及通往目的地方向），显式地锚定智能体的内部推理过程，从而显著提升导航成功率。项目网页：https://dwipddalal.github.io/AgentNav/