Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate reinforcement learning actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.
翻译:自然语言通常是人类为机器人指定任务最简便、最直观的模态。然而,学习将语言与行为对应通常需要在每个目标机器人上收集大量多样化且带有语言标注的演示数据,这在实际操作中往往难以实现。本研究旨在将"完成什么任务"与"如何完成任务"这两个问题分离,因为前者可以从大量外部纯观测数据中获益,而后者才依赖于特定的机器人具身形态。为此,我们提出视频语言评判器——一种奖励模型,该模型可利用对比学习和时序排序目标在现成的跨具身数据上进行训练,并用于评估来自独立强化学习智能体的行为轨迹。在Open X-Embodiment数据集上训练后,尽管存在显著的领域差异,我们的奖励模型在Meta-World任务上的策略训练样本效率仍比仅使用稀疏奖励的方法提升2倍。通过在Meta-World中采用领域内数据但设置更具挑战性的任务泛化场景,我们进一步证明:相较于先前基于二分类训练、使用静态图像或未充分利用视频时序信息的语言条件奖励模型,本方法能实现更高的训练样本效率。