Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.
翻译:连续环境下的视觉语言导航要求智能体从长程人类交互中学习复杂推理。尽管多模态大语言模型推动了近期进展,但现有训练范式难以平衡泛化能力、错误恢复与训练稳定性。具体而言:(i)通过监督微调得到的策略易受复合误差影响,难以从分布外状态中恢复;(ii)强化微调方法(如GRPO)受限于稀疏结果奖励,其二元反馈机制无法为单一步骤分配功劳,导致失败主导批次中的梯度信号崩溃。为解决这些挑战,我们提出步感知对比对齐框架,该框架旨在从不完美轨迹中提取密集监督。其核心组件——感知接地的步感知评估器——通过逐步评估进展,将失败轨迹解耦为有效前缀与精确分歧点。利用这些信号,场景条件分组构建机制动态地将批次路由至专用重采样与优化策略。在VLN-CE基准测试上的大量实验表明,该框架实现了最先进的性能。