Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
翻译:近期,大型视觉语言模型(VLMs)和大型语言模型(LLMs)的进展使得零样本视觉语言导航(VLN)成为可能,其中智能体仅利用自身感知和推理来遵循自然语言指令。然而,现有的零样本方法通常构建一个简单的观测图,并在其上执行每步的VLM-LLM推理,导致高延迟和高计算成本,限制了实时部署。为解决此问题,我们提出了SFCo-Nav,一个受慢-快认知协作原理启发的高效零样本VLN框架。SFCo-Nav集成了三个关键模块:1)一个基于慢速LLM的规划器,生成一系列与想象对象图关联的策略性子目标链;2)一个用于实时对象图构建和子目标执行的快速反应式导航器;以及3)一个轻量级的异步慢-快桥接器,该桥接器对齐高级结构化、带属性的想象图与感知图以估计导航置信度,仅在必要时触发慢速LLM规划器。据我们所知,SFCo-Nav是首个支持根据内部置信度异步触发LLM的慢-快协作零样本VLN系统。在公开的R2R和REVERIE基准测试中,SFCo-Nav达到或超越了先前最先进的零样本VLN成功率,同时将每条轨迹的总令牌消耗削减超过50%,运行速度提升超过3.5倍。最后,我们在酒店套房中的腿式机器人上演示了SFCo-Nav,展示了其在室内环境中的高效性和实用性。