Cross-domain Random Pre-training with Prototypes for Reinforcement Learning

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Task-agnostic cross-domain pre-training shows great potential in image-based Reinforcement Learning (RL) but poses a big challenge. In this paper, we propose CRPTpro, a Cross-domain self-supervised Random Pre-Training framework with prototypes for image-based RL. CRPTpro employs cross-domain random policy to easily and quickly sample diverse data from multiple domains, to improve pre-training efficiency. Moreover, prototypical representation learning with a novel intrinsic loss is proposed to pre-train an effective and generic encoder across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream visual-control RL tasks defined in different domains efficiently. Compared with prior arts like APT and Proto-RL, CRPTpro achieves better performance on cross-domain downstream RL tasks without extra training on exploration agents for expert data collection, greatly reducing the burden of pre-training. Experiments on DeepMind Control suite (DMControl) demonstrate that CRPTpro outperforms APT significantly on 11/12 cross-domain RL tasks with only 39% pre-training hours, becoming a state-of-the-art cross-domain pre-training method in both policy learning performance and pre-training efficiency. The complete code will be released at https://github.com/liuxin0824/CRPTpro.

翻译：本文已提交至IEEE，可能获得发表。版权可能在不另行通知的情况下转移，之后此版本可能无法访问。任务无关的跨域预训练在基于图像的强化学习（RL）中展现出巨大潜力，但也带来了重大挑战。本文提出了CRPTpro——一种基于原型的跨域自监督随机预训练框架，用于基于图像的RL。CRPTpro利用跨域随机策略，可轻松快速地从多个域中采样多样化数据，从而提高预训练效率。此外，我们提出了一种结合新型内在损失的典型表示学习方法，以预训练跨不同域的有效通用编码器。无需微调，该跨域编码器即可高效应用于不同域中具有挑战性的下游视觉控制RL任务。与APT和Proto-RL等先前方法相比，CRPTpro在跨域下游RL任务上取得了更优性能，且无需额外训练探索智能体以收集专家数据，从而大幅减轻了预训练负担。在DeepMind Control套件（DMControl）上的实验表明，CRPTpro在11/12个跨域RL任务上显著优于APT，且仅需39%的预训练时间，成为策略学习性能和预训练效率两方面均达到领先水平的跨域预训练方法。完整代码将在https://github.com/liuxin0824/CRPTpro上发布。