稀疏视频生成推动现实世界超视距视觉语言导航 (Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation)

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

翻译：为何视觉语言导航必须受限于详尽冗长的语言指令？虽然此类细节便于决策制定，但它们从根本上违背了现实世界导航的目标。理想情况下，智能体应具备在未知环境中仅凭简单高层意图引导的自主导航能力。实现这一愿景带来了严峻挑战：超视距导航，即智能体必须在缺乏密集逐步指引的情况下定位远处不可见的目标。现有基于大语言模型的方法虽擅长遵循密集指令，但由于依赖短视距监督，常表现出短视行为。然而，单纯扩展监督视距会破坏大语言模型训练的稳定性。本研究发现，视频生成模型先天受益于长视距监督以实现与语言指令的对齐，这使其特别适用于超视距导航任务。基于此洞见，我们首次提出将视频生成模型引入该领域。然而，生成持续数十秒视频的过高延迟使其难以实际部署。为弥合此差距，我们提出SparseVideoNav，通过生成跨越20秒视距的稀疏未来场景引导，实现亚秒级轨迹推断。相比未优化版本，这带来了27倍的显著加速。大量现实世界零样本实验表明，SparseVideoNav在超视距导航任务中达到最先进大语言模型基线2.5倍的成功率，并首次在具有挑战性的夜间场景中实现了此类能力。