Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/
翻译:基于互联网数据预训练已被证明是现代机器学习系统实现广泛泛化的关键要素。如何使机器人强化学习(RL)具备这种能力?离线RL方法通过学习机器人经验数据集,为将先验数据引入机器人学习流程提供了一条途径。然而,这些方法与当前机器人可用的最大先验数据集——视频数据(如Ego4D)存在“类型不匹配”,因为视频仅提供纯观测经验,缺乏RL方法所需的动作或奖励标注。本文基于时序差分学习构建价值函数,开发了一套利用大规模人类视频数据集进行机器人离线RL的系统。研究表明,相较于其他视频数据学习方法,在视频数据集上进行价值学习能够学习到更有利于下游机器人离线RL的表征。我们提出的V-PTR系统将视频数据预训练的优势与基于多样机器人数据的离线RL方法相结合,从而为操作任务生成性能更优、鲁棒性更强且泛化能力更广的价值函数与策略。在真实WidowX机器人上进行的多项操作任务中,我们的框架产生的策略大幅优于先前方法。相关视频及详细信息参见https://dibyaghosh.com/vptr/