SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

翻译：连续环境中的视觉语言导航要求智能体理解未知环境的空间结构以遵循语言指令。尽管基础模型为零样本导航（无需任务特定策略训练）开辟了前景，但许多导航智能体仍依赖局部视觉线索和基于线性历史的推理，忽视了已探索区域、行进路径、地标及其空间关系中的空间属性。本文提出SpaceVLN——一种围绕空间认知记忆与任务引导空间推理构建的导航智能体。具体而言，SpaceVLN引入高效的阶段式闭环框架，将规划与执行组织为可验证的空间-地标阶段。导航过程中，智能体逐步将已探索区域抽象为空间路点，并动态维护基于子任务的地标证据，形成层次化的空间认知记忆以支持进度定位与空间关系理解。基于该记忆，Spatial-CoT将任务进度推理与空间感知、分析及预测相融合，实现面向具身导航的任务引导空间推理。统一的阶段接口使SpaceVLN能够在统一零样本设置下同时处理视觉语言导航与目标导向导航任务，无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON基准测试中，SpaceVLN取得零样本导航的最佳性能，真实机器人部署进一步验证了其适用性。这些结果表明，空间认知记忆与任务引导空间推理可作为构建更强具身导航智能体的实用基础。