Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
翻译:在真实环境中进行增量决策是具身人工智能领域最具挑战性的任务之一。其中特别困难的场景是视觉与语言导航(VLN),这要求具备视觉与自然语言理解能力以及时空推理能力。具身代理需要将导航指令的理解锚定于真实环境(如街景)的观测中。尽管大语言模型(LLM)在其他研究领域取得了显著成果,如何将其与交互式视觉环境最佳结合仍是一个持续存在的问题。本文提出VELMA——一种具身LLM代理,通过将轨迹与视觉环境观测的语言化描述作为上下文提示来生成下一步动作。该框架通过流水线对视觉信息进行语言化处理:从人工编写的导航指令中提取地标,并利用CLIP判断这些地标在当前全景视图中的可见性。实验表明,VELMA仅需两个上下文示例即可在街景中成功完成导航指令。我们进一步在少量样本上微调LLM代理,在两个数据集上的任务完成度相比先前最优方法实现了25%-30%的相对提升。