FantasyVLN：面向视觉语言导航的统一多模态思维链推理 (FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation)

Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

翻译：在视觉与语言导航中实现人类水平性能，要求具身智能体能够同时理解多模态指令与视觉空间上下文，并对长序列动作进行推理。近期研究如NavCoT与NavGPT-2，展示了思维链推理在提升可解释性与长程规划能力方面的潜力。此外，OctoNav-R1与CoT-VLA等多模态扩展工作进一步验证了思维链作为实现类人导航推理的有效路径。然而，现有方法存在明显缺陷：纯文本思维链缺乏空间 grounding 且易对稀疏标注的推理步骤过拟合，而多模态思维链因需生成想象的视觉观测导致严重的 token 膨胀，使得实时导航难以实现。本文提出FantasyVLN——一个统一的隐式推理框架，该框架保留了思维链推理的优势，同时避免了显式的 token 开销。具体而言，在思维链推理训练过程中，通过预训练的视觉自回归模型将想象的视觉 token 编码至紧凑的潜在空间，模型在统一的多思维链策略下联合学习文本、视觉及多模态思维链模式。在推理阶段，我们的模型直接执行从指令到动作的映射，同时仍保有推理感知的表征能力。在LH-VLN数据集上的大量实验表明，本方法实现了兼具推理感知能力与实时性的导航，在提升成功率与效率的同时，将推理延迟较显式思维链方法降低了一个数量级。