We present an approach to robot learning from egocentric human videos by modeling human preferences in a reward function and optimizing robot behavior to maximize this reward. Prior work on reward learning from human videos attempts to measure the long-term value of a visual state as the temporal distance between it and the terminal state in a demonstration video. These approaches make assumptions that limit performance when learning from video. They must also transfer the learned value function across the embodiment and environment gap. Our method models human preferences by learning to predict the motion of tracked points between subsequent images and defines a reward function as the agreement between predicted and observed object motion in a robot's behavior at each step. We then use a modified Soft Actor Critic (SAC) algorithm initialized with 10 on-robot demonstrations to estimate a value function from this reward and optimize a policy that maximizes this value function, all on the robot. Our approach is capable of learning on a real robot, and we show that policies learned with our reward model match or outperform prior work across multiple tasks in both simulation and on the real robot.
翻译:我们提出了一种从第一人称人类视频中学习机器人技能的方法,该方法通过在奖励函数中建模人类偏好,并优化机器人行为以最大化该奖励。以往基于人类视频的奖励学习研究试图通过衡量视觉状态与演示视频终止状态之间的时间距离,来评估该视觉状态的长期价值。这些方法所做的假设限制了从视频中学习的性能。它们还必须跨越本体与环境差异迁移已学习的价值函数。我们的方法通过学习预测连续图像间追踪点的运动来建模人类偏好,并将奖励函数定义为机器人每一步行为中预测物体运动与观测物体运动之间的一致性。随后,我们使用经过10次机器人实体演示初始化的改进型柔性演员-评论家(SAC)算法,基于该奖励估计价值函数,并优化策略以最大化该价值函数,整个过程均在机器人实体上完成。我们的方法能够在真实机器人上进行学习,实验表明在仿真和真实机器人上的多项任务中,采用本奖励模型学习的策略均达到或超越了先前工作的性能。