AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

翻译：零样本连续环境视觉与语言导航（VLN-CE）近期已借助大型视觉语言模型（VLM）成为可能。然而，现有方法通常依赖学习得到的航点预测器来提议可执行动作，这严重限制了模型的行动空间，且未能有效利用深度输入。此外，记忆处理通常采用积累大量含无关上下文的长文本或视觉历史，或检索跨经验片段的方式，这削弱了零样本设置的有效性。本文重新将零样本VLN-CE定义为VLM与环境的智能体接口，提出轻量级导航框架AgenticNav，将动作、深度和记忆暴露为可调用工具。动作工具允许VLM直接选择RGB观测中的目标像素并转化为可执行运动，而非从预测航点中选取。深度通过按需像素深度工具暴露，使VLM仅在必要时请求精确的度量距离。记忆方面，AgenticNav提供紧凑地图图像以概括历史轨迹，并配备召回工具使VLM能有选择地回溯过往视觉观测，避免过载提示上下文。在R2R-CE基准测试中，基于相同VLM骨干网络，AgenticNav在零样本方法中达到新最优性能。实际环境验证进一步凸显其相较先前方法的零样本泛化能力。消融实验表明，我们的动作工具设计优于传统航点预测器，且深度工具与智能体记忆对导航性能均有贡献。