We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore several use cases of our language-based navigation (LangNav) approach on the R2R VLN benchmark: generating synthetic trajectories from a prompted language model (GPT-4) with which to finetune a smaller language model; domain transfer where we transfer a policy learned on one simulated environment (ALFRED) to another (more realistic) environment (R2R); and combining both vision- and language-based representations for VLN. Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories (10-100) are available, demonstrating the potential of language as a perceptual representation for navigation.
翻译:摘要:我们探索了将语言作为视觉与语言导航(VLN)感知表示的使用,重点研究低数据场景。我们的方法利用现成的视觉系统进行图像描述和物体检测,将智能体在每个时间步的自我中心全景视图转换为自然语言描述。随后,我们微调一个预训练语言模型,基于当前视图和历史轨迹来选择一个最能完成导航指令的动作。与标准设置中调整预训练语言模型以直接处理来自预训练视觉模型的连续视觉特征不同,我们的方法转而使用(离散的)语言作为感知表示。我们在R2R VLN基准上探索了我们基于语言的导航(LangNav)方法的几个用例:从提示语言模型(GPT-4)生成合成轨迹,用于微调较小的语言模型;领域迁移,将在一个模拟环境(ALFRED)中学到的策略迁移到另一个(更真实的)环境(R2R)中;以及结合基于视觉和基于语言的表示用于VLN。我们发现,在仅有少量专家轨迹(10-100条)可用的场景中,我们的方法优于依赖视觉特征的基线模型,从而展示了语言作为导航感知表示的潜力。