Correctable Landmark Discovery via Large Models for Vision-Language Navigation

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at https://github.com/expectorlin/CONSOLE.

翻译：视觉语言导航要求智能体依据语言指令抵达目标位置。成功导航的关键在于将指令中隐含的地标与多样化的视觉观测进行对齐。然而，现有的视觉语言导航智能体，尤其是面对未探索场景时，难以实现精确的模态对齐，这主要源于其从有限的导航数据中学习，缺乏足够的开放世界对齐知识。本文提出一种新的视觉语言导航范式，称为“基于大模型的可修正地标发现”（CONSOLE）。在CONSOLE中，我们将视觉语言导航重新定义为开放世界的序列地标发现问题，并引入一种基于ChatGPT和CLIP两大模型的新型可修正地标发现机制。具体而言，我们利用ChatGPT提供丰富的开放世界地标共现常识，并基于这些常识先验进行CLIP驱动的地标发现。为了缓解因缺乏视觉约束而导致先验中的噪声，我们引入了一个可学习的共现评分模块，该模块根据实际观测修正每个共现关系的重要性，从而实现精确的地标发现。我们进一步设计了一种观测增强策略，以优雅地将我们的框架与不同的视觉语言导航智能体结合，其中我们利用修正后的地标特征来获得增强的观测特征，用于动作决策。在多个主流视觉语言导航基准（R2R、REVERIE、R4R、RxR）上的大量实验结果表明，CONSOLE相较于强基线模型具有显著优势。特别是在未见场景下，我们的CONSOLE在R2R和R4R上取得了最新的最优结果。代码发布于 https://github.com/expectorlin/CONSOLE。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日