Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with improved interpretability. However, most LLM-based agents still rely on single-shot action decisions, where the model must choose one option from noisy, textualized multi-perspective observations. Due to local mismatches and imperfect intermediate reasoning, such decisions can easily deviate from the correct path, leading to error accumulation and reduced reliability in unseen environments. In this paper, we propose DV-VLN, a new VLN framework that follows a generate-then-verify paradigm. DV-VLN first performs parameter-efficient in-domain adaptation of an open-source LLaMA-2 backbone to produce a structured navigational chain-of-thought, and then verifies candidate actions with two complementary channels: True-False Verification (TFV) and Masked-Entity Verification (MEV). DV-VLN selects actions by aggregating verification successes across multiple samples, yielding interpretable scores for reranking. Experiments on R2R, RxR (English subset), and REVERIE show that DV-VLN consistently improves over direct prediction and sampling-only baselines, achieving competitive performance among language-only VLN agents and promising results compared with several cross-modal systems.Code is available at https://github.com/PlumJun/DV-VLN.
翻译:视觉语言导航(VLN)要求具身智能体根据自然语言指令在复杂三维环境中进行导航。大语言模型(LLM)的最新进展使得语言驱动的导航在可解释性方面得到提升。然而,大多数基于LLM的智能体仍依赖于单次动作决策,即模型必须从带有噪声的多视角文本化观测中选择一个选项。由于局部匹配偏差和不完善的中间推理,此类决策容易偏离正确路径,导致误差累积并在未见环境中降低可靠性。本文提出DV-VLN,一种遵循“生成-验证”范式的新型VLN框架。DV-VLN首先对开源的LLaMA-2主干网络进行参数高效的领域内适配,以生成结构化的导航思维链,随后通过两个互补通道验证候选动作:真伪验证(TFV)与掩码实体验证(MEV)。DV-VLN通过聚合多采样样本的验证成功次数来选择动作,产生用于重排序的可解释分数。在R2R、RxR(英文子集)和REVERIE数据集上的实验表明,DV-VLN相较于直接预测和仅采样基线方法取得持续改进,在纯语言VLN智能体中达到具有竞争力的性能,与多种跨模态系统相比亦展现出有前景的结果。代码发布于https://github.com/PlumJun/DV-VLN。