Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
翻译:在真实环境中进行增量决策是具身人工智能中最具挑战性的任务之一。其中尤以视觉与语言导航(VLN)最具代表性,该任务要求模型具备视觉理解、自然语言理解以及时空推理能力。具身智能体需要将导航指令的理解映射至街景等真实环境的观测结果。尽管大语言模型在其他研究领域取得了令人瞩目的成果,但如何将其与交互式视觉环境有效结合仍是一个持续难题。本文提出VELMA——一种具身化LLM智能体,通过将轨迹与视觉环境观测转化为语言描述作为上下文提示,用于生成下一步动作。视觉信息通过管道流程实现语言化:该流程从人工编写的导航指令中提取地标,并利用CLIP判断其在当前全景视图中的可见性。实验表明,VELMA仅需两个上下文示例即可在街景中成功完成导航指令。我们进一步对LLM智能体进行少量样本微调,在两组数据集上实现了相较先前最优方法25%-30%的任务完成率相对提升。