Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.
翻译:视觉语言导航要求智能体在长时程任务中保持连贯行动,不仅需要理解局部视觉上下文,还需感知自身在多步指令中的推进程度。然而,当前的视觉-语言-动作模型聚焦于直接动作预测,而早期的进度方法仅预测数值成就,两者均忽略了观测序列与指令序列的单调共进特性。基于这一洞察,Progress-Think引入了语义进度推理,通过从视觉观测中预测指令式的进度表征,实现更精准的导航。为规避昂贵的标注成本,我们提出三阶段框架。初始阶段中,自对齐进度预训练通过视觉历史与指令前缀间新颖的可微对齐机制,引导推理模块启动。随后,进度引导策略预训练将学习到的进度状态注入导航上下文,推动策略产生一致动作。最终,进度-策略联合微调通过专设的进度感知强化目标,对两个模块进行联合优化。在R2R-CE与RxR-CE上的实验表明,该方法在成功率和效率上均达到最优水平,证明语义进度能为导航进程提供更一致的表征。