Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts. Possible forms of visual prompts include both exact and similar object images, providing adaptability and versatility in diverse navigation scenarios. To evaluate VLN-MP under a unified framework, we implement a new benchmark that offers: (1) a training-free pipeline to transform textual instructions into multi-modal forms with landmark images; (2) diverse datasets with multi-modal instructions for different downstream tasks; (3) a novel module designed to process various image prompts for seamless integration with state-of-the-art VLN models. Extensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance. While maintaining efficiency with text-only prompts, VLN-MP enables agents to navigate in the pre-explore setting and outperform text-based models, showing its broader applicability.

翻译：当前的视觉与语言导航任务主要采用文本指令来引导智能体。然而，由于文本指令固有的抽象性，同一文本指令可能对应不同的视觉信号，导致严重的歧义性，并限制了用户向智能体传递视觉领域先验知识的能力。为填补这一空白，我们提出了多模态提示下的视觉与语言导航任务，该创新任务通过将自然语言和图像共同整合至指令中，对传统视觉与语言导航进行了增强。该任务不仅通过有效处理纯文本提示保持向后兼容性，还能在不同数量与相关性的视觉提示下持续展现优势。视觉提示的可能形式包括精确物体图像与相似物体图像，为多样化导航场景提供了适应性与多功能性。为在统一框架下评估该任务，我们构建了新的基准测试平台，其具备以下特性：（1）通过免训练流程将文本指令转化为包含地标图像的多模态形式；（2）为不同下游任务提供包含多模态指令的多样化数据集；（3）设计新型处理模块以解析各类图像提示，实现与前沿视觉与语言导航模型的无缝集成。在四个视觉与语言导航基准测试上的大量实验表明，引入视觉提示能显著提升导航性能。在保持纯文本提示处理效率的同时，该任务使智能体能够在预探索环境中进行导航，其表现优于纯文本模型，展现出更广泛的适用性。