Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
翻译:指令跟随型智能体必须将语言锚定至其观测与动作空间中。学习语言锚定具有挑战性,通常需要特定领域的工程化处理或大量人类交互数据。为解决这一难题,我们提出利用预训练的视觉-语言模型(VLM)监督具身智能体。我们融合了模型蒸馏与事后经验回放(HER)的思想,使用VLM逆向生成描述智能体行为的语言。通过简单提示工程,我们可控制监督信号,教导智能体在3D渲染环境中基于物体名称(如飞机)或特征(如颜色)与新颖物体交互。少样本提示使我们能够教授抽象类别归属,包括预定义类别(食物与玩具)和临时类别(对物体的任意偏好)。本文提出了一种利用互联网规模VLM的新颖有效方法,重新利用此类模型所获得的通用语言锚定能力,为具身智能体教授任务相关的锚定知识。