Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.
翻译:视觉语言导航要求智能体在长视野内协调行动,不仅需要理解局部视觉上下文,还需明确其在多步指令中的进展程度。然而,当前的视觉-语言-动作模型侧重于直接动作预测,而早期的进展方法仅预测数值化成就;两者均忽视了观察序列与指令序列之间的单调协同进展特性。基于这一洞见,Progress-Think引入了语义进展推理,通过从视觉观察中预测指令式进展,以实现更精确的导航。为在不依赖昂贵标注的情况下实现此目标,我们提出了一个三阶段框架。在初始阶段,自对齐进展预训练通过视觉历史与指令前缀间的新型可微分对齐机制,引导推理模块的初始化。随后,进展引导策略预训练将学习到的进展状态注入导航上下文,引导策略生成一致的动作。最后,进展-策略协同微调通过定制的进展感知强化目标,联合优化两个模块。在R2R-CE和RxR-CE数据集上的实验展示了最先进的成功率和效率,证明语义进展能为导航进程提供更一致的表示。