Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: http://tiny.cc/grif .
翻译:本文旨在使机器人能够遵循“将毛巾放在微波炉旁边”这类自然语言指令。然而,获取大规模标注数据(即包含语言指令标注的任务演示数据)成本高昂。相比之下,获得响应图像目标的策略则容易得多,因为任何自主试验或演示都可以在事后用其最终状态作为目标进行标注。为此,我们提出一种方法,利用少量语言数据实现联合图像目标和语言条件策略的融合。先前工作已通过视觉-语言模型或联合训练语言-目标条件策略取得了进展,但迄今为止,这两种方法在无需大量人工标注的情况下均未能有效扩展到真实机器人任务。我们的方法通过在标注数据中学习嵌入表征,使得语言不仅与目标图像对齐,更着眼于指令所对应的起始图像与目标图像之间的期望变化。随后,我们在此嵌入表征上训练策略:该策略受益于所有无标注数据,而对齐的嵌入表征则为语言引导策略提供了接口。我们在不同场景的多种操作任务中展示了指令跟随能力,并验证了对标注数据外语言指令的泛化性。相关视频与代码可访问我们的网站获取:http://tiny.cc/grif