We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back predictions for out-of-distribution observations, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.
翻译:本文提出PROGRESSOR,这是一种从视频中学习任务无关奖励函数的新型框架,能够通过目标条件强化学习进行策略训练而无需人工监督。该奖励函数的核心是基于当前、初始和目标观测值构建的任务进度分布估计量,该估计量通过自监督方式学习。关键创新在于,PROGRESSOR在在线强化学习训练期间通过对抗性机制对异常分布观测值进行预测回推,从而精炼奖励函数,以缓解非专家观测数据中固有的分布偏移问题。通过将这种进度预测作为密集奖励信号并结合对抗性回推机制,我们证明PROGRESSOR能使机器人在没有任何外部监督的情况下学习复杂行为。基于EPIC-KITCHENS大规模第一人称人类视频数据进行预训练后,PROGRESSOR无需在领域内任务特定数据上进行微调,即可在噪声演示条件下泛化至真实机器人离线强化学习场景,其性能优于当前为机器人学习提供密集视觉奖励的先进方法。我们的研究结果凸显了PROGRESSOR在难以直接获取动作标签和任务特定奖励的可扩展机器人应用中的潜力。