The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) - a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB.
翻译:无监督强化学习(URL)的目标是在任务领域中寻找一个与奖励无关的先验策略,以提升监督下游任务的样本效率。尽管采用此类先验策略初始化的智能体在下游任务微调时能以更少的样本获得显著更高的奖励,但如何在实践中获得最优预训练先验策略仍是一个开放问题。本文提出POLTER(策略轨迹集成正则化)——一种通用预训练正则化方法,可应用于任意URL算法,尤其适用于基于数据和知识的URL算法。该方法利用预训练过程中发现的策略集成,将URL算法的策略向最优先验方向优化。我们基于理论框架提出该方法,并在白盒基准测试中分析其实践效果,从而实现完全可控的POLTER研究。在主要实验中,我们评估了POLTER在包含3个领域12项任务的无监督强化学习基准(URLB)上的表现。通过将多种基于数据和知识的URL算法的性能平均提升19%(最佳情况提升40%),证明了方法的通用性。在与调优基线及调优POLTER的公平对比中,我们为URLB上的无模型方法建立了新的最优结果。