Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at https://github.com/expectorlin/CONSOLE.
翻译:视觉语言导航要求智能体依据语言指令抵达目标位置。成功导航的关键在于将指令中隐含的地标与多样化的视觉观测进行对齐。然而,现有的视觉语言导航智能体,尤其是面对未探索场景时,难以实现精确的模态对齐,这主要源于其从有限的导航数据中学习,缺乏足够的开放世界对齐知识。本文提出一种新的视觉语言导航范式,称为“基于大模型的可修正地标发现”(CONSOLE)。在CONSOLE中,我们将视觉语言导航重新定义为开放世界的序列地标发现问题,并引入一种基于ChatGPT和CLIP两大模型的新型可修正地标发现机制。具体而言,我们利用ChatGPT提供丰富的开放世界地标共现常识,并基于这些常识先验进行CLIP驱动的地标发现。为了缓解因缺乏视觉约束而导致先验中的噪声,我们引入了一个可学习的共现评分模块,该模块根据实际观测修正每个共现关系的重要性,从而实现精确的地标发现。我们进一步设计了一种观测增强策略,以优雅地将我们的框架与不同的视觉语言导航智能体结合,其中我们利用修正后的地标特征来获得增强的观测特征,用于动作决策。在多个主流视觉语言导航基准(R2R、REVERIE、R4R、RxR)上的大量实验结果表明,CONSOLE相较于强基线模型具有显著优势。特别是在未见场景下,我们的CONSOLE在R2R和R4R上取得了最新的最优结果。代码发布于 https://github.com/expectorlin/CONSOLE。