Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
翻译:视觉语言导航要求智能体遵循长时程指令并在复杂三维环境中进行导航。然而,现有方法面临两大挑战:构建有效的长期记忆库以及克服误差累积问题。为解决这些问题,我们提出DecoVLN——一个专为长时程导航中鲁棒的流式感知与闭环控制而设计的有效框架。首先,我们将长期记忆构建形式化为一个优化问题,并引入自适应精炼机制,该机制通过迭代优化统一评分函数从历史候选池中选择关键帧。该函数联合平衡三个关键准则:与指令的语义相关性、已选记忆的视觉多样性以及历史轨迹的时间覆盖度。其次,为缓解误差累积,我们提出一种状态-动作对级别的校正微调策略。通过利用状态间的测地线距离精确量化与专家轨迹的偏差,智能体在可信区域内收集高质量的状态-动作对,同时过滤掉相关性低的污染数据。这提升了误差校正的效率和稳定性。大量实验验证了DecoVLN的有效性,我们已将其部署于真实世界环境中。