Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path(VoP), which explicitly grounds the agent's internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/
翻译:利用多模态大语言模型(MLLMs)开发具身智能体,为解决复杂现实世界任务提供了重要前景。然而,当前评估基准仍主要侧重于语言能力或严重依赖模拟环境,很少探究实际现实场景所必需的、精细且知识密集的推理能力。为弥合这一关键差距,我们提出了稀疏接地视觉导航任务,该任务明确设计用于评估MLLMs在具有挑战性、知识密集的现实环境中的序列决策能力。我们通过CityNav基准实现该任务,该基准涵盖四个全球多样化城市,专门构建用于评估原始MLLM驱动智能体在城市导航中的表现。智能体需完全依赖视觉输入和内部多模态推理,在超过50个决策点上进行序列化导航,无需额外环境标注或专用架构修改。关键在于,智能体必须通过解读城市特定线索和识别地标实现自主定位,执行空间推理,并战略性地规划及执行通往目的地的路线。通过广泛评估,我们发现当前最先进的MLLMs、推理技术(如GEPA、思维链、反思机制)以及竞争性基线方法PReP在这一挑战性场景中均表现显著不足。为此,我们提出路径言语化方法,通过从MLLM中提取城市尺度认知地图(关键地标及通往目的地方向),显式地锚定智能体的内部推理过程,从而显著提升导航成功率。项目网页:https://dwipddalal.github.io/AgentNav/