Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings extracted from general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings.
翻译:人类能够通过利用背景世界知识快速学习新行为。相比之下,基于强化学习(RL)训练的智能体通常需要从零开始学习行为。为此,我们提出了一种新方法,利用互联网规模数据预训练的视觉-语言模型(VLMs)中编码的通用且可索引的世界知识,用于具身强化学习。我们通过将VLMs用作可提示的表征来初始化策略:这些表征基于视觉观察,并通过提示(提供任务上下文和辅助信息)触发VLM内部知识,从而编码语义特征。我们在Minecraft中视觉复杂、长时域强化学习任务以及Habitat中的机器人导航任务上评估了该方法。研究发现,基于通用VLM提取的嵌入训练的策略优于基于通用非可提示图像嵌入训练的等效策略。此外,我们的方法性能优于指令跟随方法,且与领域特定嵌入方法表现相当。