Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.

翻译：视觉语言导航要求智能体依据自然语言指令在未知环境中进行导航。近年来，基于大型语言模型的高层导航器因其灵活性与推理能力得到广泛应用。然而，基于提示的LLM导航常存在决策效率低下的问题，因为模型必须在每一步重新解析指令，并在大量冗余的可导航候选路径中进行推理。本文提出一种检索增强框架，旨在不修改或微调底层语言模型的前提下，提升基于LLM的视觉语言导航的效率与稳定性。该方法在互补的两个层面引入检索机制：在任务层面，指令级嵌入检索器选取语义相似的成功导航轨迹作为上下文示例，为指令落地提供任务先验；在步骤层面，通过模仿学习训练的可导航候选检索器在LLM推理前过滤无关的导航方向，从而降低动作歧义与提示复杂度。两个检索模块均采用轻量化、模块化设计，且独立于LLM进行训练。我们在Room-to-Room基准数据集上评估了该方法。实验结果表明，该方法在已知与未知环境中均能持续提升导航成功率、最优成功率及路径长度加权成功率。消融研究进一步揭示，指令级示例检索与候选路径剪枝分别在全局路径引导与逐步决策效率方面产生互补性增益。这些结果证明，检索增强的决策支持是提升基于LLM的视觉语言导航效果的可扩展有效策略。