VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.

翻译：奖励与表征学习是当前基于感官观测学习不断扩展的机器人操作技能所面临的两大长期挑战。考虑到领域内特定任务机器人数据固有的获取成本高、数据稀缺等问题，利用大规模、多样化的离线人类视频进行学习已成为获取通用视觉控制表征的有效途径；然而，如何将这些人类视频用于通用奖励学习仍是一个悬而未决的问题。本文提出**价**值**隐**式**预**训练（VIP）方法——一种能够为未见过的机器人任务生成密集且平滑奖励函数的自监督预训练视觉表征。VIP将人类视频的表征学习建模为离线目标条件强化学习问题，并推导出无需动作信息的自监督双目标条件价值函数目标，从而可在未标注的人类视频上进行预训练。从理论上看，VIP可理解为一种新颖的隐式时间对比学习目标，其生成的时序平滑嵌入使得价值函数可经由嵌入距离隐式定义，进而为任意目标图像指定的下游任务构建奖励函数。基于大规模Ego4D人类视频训练，且无需在领域内特定任务数据上进行微调，VIP的冻结表征可为大量模拟任务及**真实机器人**任务提供密集视觉奖励，支持多种基于奖励的视觉控制方法，性能全面超越所有先前的预训练表征。值得注意的是，VIP能够以低至20条轨迹的数据量，在真实世界机器人任务套件上实现简单的**少样本**离线强化学习。