Recent research on Vision-and-Language Navigation (VLN) indicates that agents suffer from poor generalization in unseen environments due to the lack of realistic training environments and high-quality path-instruction pairs. Most existing methods for constructing realistic navigation scenes have high costs, and the extension of instructions mainly relies on predefined templates or rules, lacking adaptability. To alleviate the issue, we propose InstruGen, a VLN path-instruction pairs generation paradigm. Specifically, we use YouTube house tour videos as realistic navigation scenes and leverage the powerful visual understanding and generation abilities of large multimodal models (LMMs) to automatically generate diverse and high-quality VLN path-instruction pairs. Our method generates navigation instructions with different granularities and achieves fine-grained alignment between instructions and visual observations, which was difficult to achieve with previous methods. Additionally, we design a multi-stage verification mechanism to reduce hallucinations and inconsistency of LMMs. Experimental results demonstrate that agents trained with path-instruction pairs generated by InstruGen achieves state-of-the-art performance on the R2R and RxR benchmarks, particularly in unseen environments. Code is available at https://github.com/yanyu0526/InstruGen.
翻译:近期关于视觉语言导航的研究表明,由于缺乏真实的训练环境与高质量的路径-指令对,智能体在未见环境中的泛化能力较差。现有构建真实导航场景的方法大多成本高昂,且指令扩展主要依赖预定义模板或规则,缺乏适应性。为缓解该问题,我们提出InstruGen,一种视觉语言导航路径-指令对生成范式。具体而言,我们使用YouTube房屋导览视频作为真实导航场景,并利用大型多模态模型强大的视觉理解与生成能力,自动生成多样化且高质量的视觉语言导航路径-指令对。本方法可生成不同粒度的导航指令,并实现指令与视觉观测之间的细粒度对齐,这是以往方法难以实现的。此外,我们设计了多阶段验证机制以降低大型多模态模型的幻觉与不一致性。实验结果表明,使用InstruGen生成的路径-指令对训练的智能体在R2R与RxR基准测试中取得了最先进的性能,尤其在未见环境中表现突出。代码发布于https://github.com/yanyu0526/InstruGen。