Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
翻译:利用纯视频数据进行强化学习(RL)预训练是一个重要但具有挑战性的问题。尽管野外视频易于获取且蕴含大量先验世界知识,但动作标注的缺失以及与下游任务常见的领域差距阻碍了利用视频进行RL预训练。为应对视频预训练的挑战,我们提出预训练视觉动态表征(PVDR),以弥合视频与下游任务之间的领域差距,从而实现高效策略学习。通过采用视频预测作为预训练任务,我们使用基于Transformer的条件变分自编码器(CVAE)来学习视觉动态表征。预训练的视觉动态表征捕获了视频中的视觉动态先验知识。这种抽象的先验知识可以轻松适应下游任务,并通过在线调整与可执行动作对齐。我们在系列机器人视觉控制任务上进行了实验,验证了PVDR是利用视频进行预训练以促进策略学习的有效形式。