We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. Exploiting a novel connection between dual reinforcement learning and mutual information contrastive learning, the LIV objective trains a multi-modal representation that implicitly encodes a universal value function for tasks specified as language or image goals. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen. Given only a language or image goal, the pre-trained LIV model can assign dense rewards to each frame in videos of unseen robots or humans attempting that task in unseen environments. Further, when some target domain-specific data is available, the same objective can be used to fine-tune and improve LIV and even other pre-trained representations for robotic control and reward specification in that domain. In our experiments on several simulated and real-world robot environments, LIV models consistently outperform the best prior input state representations for imitation learning, as well as reward specification methods for policy synthesis. Our results validate the advantages of joint vision-language representation and reward learning within the unified, compact LIV framework.
翻译:我们提出语言-图像价值学习(LIV),这是一种从带有文本标注的无动作视频中学习视觉-语言表示和奖励的统一目标函数。通过利用对偶强化学习与互信息对比学习之间的新颖联系,LIV目标函数训练了一种多模态表示,该表示隐式编码了针对以语言或图像目标指定的任务的通用价值函数。我们利用LIV从EpicKitchen等大型人类视频数据集中预训练了首个以控制为中心的视觉-语言表示。仅给定语言或图像目标,预训练的LIV模型即可为未见过的机器人或人类在未知环境中尝试该任务的视频中每一帧分配密集奖励。此外,当某些目标领域特定数据可用时,相同目标函数可用于微调并改进LIV及其他预训练表示,以用于该领域内的机器人控制和奖励规范。在多个模拟和真实机器人环境的实验中,LIV模型在模仿学习的输入状态表示和策略合成的奖励规范方法上均持续优于现有最佳方法。我们的结果验证了在统一且紧凑的LIV框架内联合进行视觉-语言表示与奖励学习的优势。