Teaching robots novel skills with demonstrations via human-in-the-loop data collection techniques like kinesthetic teaching or teleoperation puts a heavy burden on human supervisors. In contrast to this paradigm, it is often significantly easier to provide raw, action-free visual data of tasks being performed. Moreover, this data can even be mined from video datasets or the web. Ideally, this data can serve to guide robot learning for new tasks in novel environments, informing both "what" to do and "how" to do it. A powerful way to encode both the "what" and the "how" is to infer a well-shaped reward function for reinforcement learning. The challenge is determining how to ground visual demonstration inputs into a well-shaped and informative reward function. We propose a technique Rank2Reward for learning behaviors from videos of tasks being performed without access to any low-level states and actions. We do so by leveraging the videos to learn a reward function that measures incremental "progress" through a task by learning how to temporally rank the video frames in a demonstration. By inferring an appropriate ranking, the reward function is able to guide reinforcement learning by indicating when task progress is being made. This ranking function can be integrated into an adversarial imitation learning scheme resulting in an algorithm that can learn behaviors without exploiting the learned reward function. We demonstrate the effectiveness of Rank2Reward at learning behaviors from raw video on a number of tabletop manipulation tasks in both simulations and on a real-world robotic arm. We also demonstrate how Rank2Reward can be easily extended to be applicable to web-scale video datasets.
翻译:通过人类参与的交互式数据收集技术(如动觉示教或远程操作)向机器人教授新技能,会给人类监督者带来沉重负担。与此范式不同,直接提供任务执行过程中的原始无动作视觉数据通常更为简便。此类数据甚至可以从视频数据集或互联网中挖掘。理想情况下,该数据可指导机器人在新环境中学习新任务,同时告知其"做什么"与"怎么做"。编码"做什么"与"怎么做"的有效方法是为强化学习推断一个良好塑形的奖励函数。挑战在于如何将视觉示教输入转化为形态良好且信息丰富的奖励函数。我们提出Rank2Reward技术,无需任何底层状态和动作信息,仅通过任务执行视频即可学习行为。该方法利用视频学习一个奖励函数,该函数通过学习对示教视频帧进行时间排序来度量任务的增量"进展"。通过推断恰当排序,奖励函数能够指示任务进展何时发生,从而引导强化学习。此排序函数可融入对抗模仿学习框架,形成一种能避免利用学习所得奖励函数的行为学习算法。我们在多个桌面操作任务的仿真环境及真实机械臂上验证了Rank2Reward从原始视频学习行为的有效性,同时展示了其可轻松扩展至网络级视频数据集的能力。