We propose Value Explicit Pretraining (VEP), a method that learns generalizable representations for transfer reinforcement learning. VEP enables learning of new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations, irrespective of appearance changes and environment dynamics. To pre-train the encoder from a sequence of observations, we use a self-supervised contrastive loss that results in learning temporally smooth representations. VEP learns to relate states across different tasks based on the Bellman return estimate that is reflective of task progress. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks. VEP achieves up to a 2 times improvement in rewards on Atari and visual navigation, and up to a 3 times improvement in sample efficiency. For videos of policy performance visit our https://sites.google.com/view/value-explicit-pretraining/
翻译:我们提出显式价值预训练(VEP)方法,该方法旨在为迁移强化学习学习具有泛化能力的表征。VEP通过学习面向目标的条件下表征编码器,使模型能学习与先前任务具有相似目标的新任务,且不受外观变化和环境动态的影响。为从观测序列中预训练编码器,我们采用自监督对比损失函数,从而学习时间平滑的表征。VEP基于反映任务进展的贝尔曼回报估计,在不同任务间建立状态关联。在逼真导航模拟器和Atari基准上的实验表明,本方法产生的预训练编码器在泛化至未见任务的能力上优于当前最先进的预训练方法。VEP在Atari和视觉导航任务中实现了最高2倍的奖励提升,以及最高3倍的样本效率提升。策略性能演示视频请访问我们的网站 https://sites.google.com/view/value-explicit-pretraining/