SpatialNav：利用空间场景图实现零样本视觉语言导航 (SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation)

Although learning-based vision-and-language navigation (VLN) agents can learn spatial knowledge implicitly from large-scale training data, zero-shot VLN agents lack this process, relying primarily on local observations for navigation, which leads to inefficient exploration and a significant performance gap. To deal with the problem, we consider a zero-shot VLN setting that agents are allowed to fully explore the environment before task execution. Then, we construct the Spatial Scene Graph (SSG) to explicitly capture global spatial structure and semantics in the explored environment. Based on the SSG, we introduce SpatialNav, a zero-shot VLN agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation. Comprehensive experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods. Such results highlight the importance of global spatial representations for generalizable navigation.

翻译：尽管基于学习的视觉语言导航（VLN）智能体能够从大规模训练数据中隐式学习空间知识，但零样本VLN智能体缺乏这一过程，主要依赖局部观测进行导航，导致探索效率低下并产生显著的性能差距。为解决该问题，我们考虑一种零样本VLN设定，允许智能体在执行任务前充分探索环境。随后，我们构建空间场景图（SSG）以显式捕捉已探索环境中的全局空间结构与语义信息。基于SSG，我们提出SpatialNav——一种集成智能体中心空间地图、罗盘对齐视觉表征及远程目标定位策略以实现高效导航的零样本VLN智能体。在离散与连续环境中的综合实验表明，SpatialNav显著优于现有零样本智能体，并明显缩小了与最先进基于学习方法的性能差距。该结果凸显了全局空间表征对泛化性导航的重要性。