Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0$\pm$1.2$\%$ in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20x more data.
翻译:无监督强化学习(URL)在无外部奖励引导的任务无关环境中,为学习有用行为提供了有前景的范式,旨在促进下游任务的快速适应。先前研究侧重于无模型方式的预训练,但缺乏对转移动力学建模的探索,这为下游任务样本效率的提升留下了较大空间。为此,我们提出高效无监督强化学习框架——基于多选择动力学模型(EUCLID),该框架引入一种新颖的模型融合范式,在预训练阶段联合预训练动力学模型和无监督探索策略,从而更好地利用环境样本并提升下游任务采样效率。然而,构建一个能捕捉不同行为下局部动力学的通用模型仍具有挑战性。我们引入多选择动力学模型,该模型同时覆盖不同行为下的多种局部动力学:在无监督预训练期间使用不同头部学习不同行为下的状态转移,并在下游任务中选择最合适的头部进行预测。在操作与运动领域的实验结果表明,EUCLID以高样本效率实现了最先进的性能,基本解决了基于状态的URLB基准测试,在10万步微调阶段的下游任务中达到104.0±1.2%的平均归一化分数,相当于DDPG在200万交互步(数据量多20倍)时的性能。