LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
翻译:基于大语言模型(LLM)的智能体在视觉语言导航(VLN)任务中展现了出色的零样本性能。然而,大多数零样本方法主要依赖闭源LLM作为导航器,面临着高昂的令牌成本和潜在数据泄露风险的挑战。近期研究尝试通过将开源LLM与时空思维链框架结合来解决此问题,但其性能仍远逊于闭源模型。在本工作中,我们通过对导航过程的详细分析,识别出一个关键问题——导航遗忘症。该问题导致导航失败,并扩大了开源与闭源方法之间的性能差距。为解决此问题,我们提出了HiMemVLN,它将分层记忆系统集成到多模态大模型中,以增强视觉感知回忆和长期定位能力,从而缓解遗忘问题并提升智能体的导航性能。在模拟和真实环境中的大量实验表明,HiMemVLN的性能达到了当前最佳开源方法近两倍的水平。代码已发布于 https://github.com/lvkailin0118/HiMemVLN。