Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
翻译:视觉语言导航要求智能体遵循长程指令并在复杂三维环境中导航。然而,现有方法面临两大挑战:构建有效的长期记忆库与克服累积误差问题。为解决上述问题,我们提出DecoVLN——一种专为长程导航中鲁棒流式感知与闭环控制设计的高效框架。首先,我们将长期记忆构建表述为优化问题,并引入自适应精化机制,通过迭代优化统一评分函数从历史候选池中筛选关键帧。该评分函数联合平衡三项关键准则:与指令的语义关联性、已选记忆的视觉多样性以及历史轨迹的时间覆盖度。其次,为缓解累积误差,我们提出状态-动作对级别的校正微调策略。通过利用状态间测地距离精确量化与专家轨迹的偏差,智能体在可信区域中采集高质量状态-动作对,同时过滤低相关性污染数据。该方法提升了误差校正的效率与稳定性。大量实验验证了DecoVLN的有效性,且我们已将其部署至真实环境。