We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.
翻译:摘要:我们探索了语言作为视觉-语言导航中感知表示的应用。我们的方法利用现成的视觉系统(用于图像描述和目标检测),在每一步将智能体的全景视角转换为自然语言描述。随后,我们微调一个预训练语言模型,根据当前视图和轨迹历史选择最符合导航指令的动作。与标准设置(即适配预训练语言模型直接与来自预训练视觉模型的连续视觉特征协同工作)不同,我们的方法采用(离散的)语言作为感知表示。我们在R2R视觉-语言导航基准上探讨了基于语言的导航(LangNav)方法的两个用例:从提示的大型语言模型(GPT-4)生成合成轨迹,用于微调较小的语言模型;以及从模拟环境(ALFRED)中学到的策略迁移到真实环境(R2R)中的模拟到现实迁移。实验发现,在仅提供少量金轨迹(10-100条)的情况下,我们的方法优于依赖视觉特征的强基线,展示了语言作为导航任务感知表示的潜力。