This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose \textbf{C}ross-domain \textbf{R}andom \textbf{P}re-\textbf{T}raining with \textbf{pro}totypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54\% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency. The complete code is available at https://github.com/liuxin0824/CRPTpro.
翻译:本文已提交至IEEE进行可能的发表。版权可能未经通知即被转让,此后该版本可能无法再访问。无监督跨域强化学习(RL)预训练在具有挑战性的连续视觉控制任务中展现出巨大潜力,但也面临重大挑战。本文提出基于原型的跨域随机预训练框架(CRPTpro),这是一种新颖、高效且有效的自监督跨域RL预训练框架。CRPTpro将数据采样与编码器预训练解耦,提出解耦随机采集方法,可简便快速地生成合格的跨域预训练数据集。此外,本文提出一种新颖的原型自监督算法,用于预训练通用且跨不同域有效的视觉编码器。无需微调,该跨域编码器即可直接应用于不同域(包括已见与未见域)中定义的下游任务。与近期先进方法相比,CRPTpro在下游策略学习上表现更优,且无需额外训练用于数据采集的探索智能体,大幅减轻了预训练负担。我们在八个具有挑战性的连续视觉控制域(包括平衡控制、机器人运动与操作)中进行了广泛实验。CRPTpro在11/12个跨域下游任务中显著优于次优方法Proto-RL(C),仅需其54%的墙上时钟预训练时间,以大幅提升的预训练效率展现了最先进的预训练性能。完整代码见https://github.com/liuxin0824/CRPTpro。