Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
翻译:观察人类演示者操作物体为学习机器人策略提供了丰富、可扩展且低成本的数据来源。然而,将技能从人类视频迁移到机器人操作器面临多项挑战,尤其是动作空间和观测空间的差异。在本工作中,我们利用人类解决多种操作任务的未标注视频,学习一种任务无关的机器人操作策略奖励函数。得益于训练数据的多样性,所学习的奖励函数能够充分泛化至来自先前未见过的机器人实体和环境的图像观测,为强化学习中的定向探索提供有意义的先验知识。我们提出两种对状态相对于目标图像进行评分的方法:通过直接时间回归,以及通过时间对比学习获得的嵌入空间中的距离。通过将函数依赖于目标图像,我们能够跨多种任务复用同一模型。与先前利用人类视频教导机器人的工作不同,我们的方法——人类离线学习距离(HOLD)既不需要来自机器人环境的先验数据,也不需一组任务特定的人类演示,亦无需跨形态的预定义对应关系,却能相比仅使用任务完成获得的稀疏奖励,加速模拟机器人臂上多种操作任务的训练。