Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
翻译:视觉-语言导航(VLN)要求智能体遵循长程指令并在复杂3D环境中进行导航。然而,现有方法面临两大挑战:构建有效的长期记忆库以及克服累积误差问题。针对这些问题,我们提出DecoVLN——一个专为长程导航中鲁棒流式感知与闭环控制设计的高效框架。首先,我们将长期记忆构建形式化为优化问题,引入自适应精炼机制,通过迭代优化统一评分函数从历史候选池中筛选帧。该函数联合平衡三个关键标准:与指令的语义相关性、与所选记忆的视觉多样性以及历史轨迹的时间覆盖度。其次,为缓解累积误差,我们提出状态-动作对级别的矫正微调策略。通过利用状态间测地距离精确量化与专家轨迹的偏差,智能体在可信区域中收集高质量状态-动作对,同时滤除低相关性的污染数据。该方法提升了误差校正的效率与稳定性。大量实验验证了DecoVLN的有效性,我们已在真实环境中部署该系统。