FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise instructions. We first construct the FloorPlan-VLN dataset, which comprises over 10k episodes across 72 scenes. It pairs more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions that omit step-by-step guidance. Then, we propose a simple yet effective method \textbf{FP-Nav} that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60\% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and floor plan distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.

翻译：现有的视觉语言导航任务要求智能体遵循冗长的指令，忽略了部分潜在有用的全局空间先验，限制了其推理空间结构的能力。尽管人类可读的空间示意图（如平面图）在真实建筑中普遍存在，但现有智能体缺乏理解与利用这些信息的能力。为弥补这一差距，本文提出 **FloorPlan-VLN**——一种利用结构化语义平面图作为全局空间先验、仅需简洁指令即可实现导航的新范式。我们首先构建了 FloorPlan-VLN 数据集，涵盖 72 个场景中超过 1 万条导航轨迹，将 100 余幅带语义标注的平面图与基于 Matterport3D 的导航轨迹及省略逐步指引的简洁指令进行配对。随后，我们提出一种简单而有效的方法 **FP-Nav**，该方法通过双视角、时空对齐的视频序列及辅助推理任务，实现对观测信息、平面图与指令的对齐。在此新基准测试中，我们的方法显著优于经适配的先进视觉语言导航基线模型，导航成功率相对提升超过 60%。此外，全面的噪声建模与真实场景部署实验表明，FP-Nav 对执行器漂移和平面图畸变具有可行性与鲁棒性。这些结果验证了平面图引导导航的有效性，并凸显 FloorPlan-VLN 为实现更高空间智能的导航迈出了重要一步。