Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: https://rail-berkeley.github.io/grif/ .
翻译:我们的目标是让机器人能够遵循“将毛巾放在微波炉旁边”等自然语言指令。然而,获取大量标注数据(即包含语言指令标注的任务演示数据)成本过高。相比之下,获取响应图像目标的策略则容易得多,因为任何自主试验或演示都可以事后用其最终状态作为目标进行标注。在本工作中,我们提出了一种方法,仅需少量语言数据即可利用联合图像与目标条件策略实现语言控制。先前研究通过视觉-语言模型或联合训练语言-目标条件策略在此方向取得进展,但迄今为止,这两种方法均未能在无需大量人工标注的情况下有效扩展到真实世界机器人任务。我们的方法通过在标注数据中学习一种嵌入来实现鲁棒的真实世界性能:该嵌入将语言与指令对应的起始图像与目标图像之间的期望变化对齐,而非直接对齐目标图像。随后,我们基于该嵌入训练策略:策略受益于所有未标注数据,而对齐后的嵌入则提供了语言引导策略的接口。我们在不同场景中的多种操作任务上展示了指令跟随能力,并实现了对标注数据之外语言指令的泛化。本方法的视频和代码可通过我们的网站获取:https://rail-berkeley.github.io/grif/。