Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the generated tasks. This is a non-stationary process where the task distribution evolves along with agent policies; creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel unsupervised curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by maximizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. Using the fixed-pretrained task manifold, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in the challenging CarRacing and navigation environments: achieving 10.6X and 45\% improvement in zero-shot generalization, respectively. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, while requiring 500X fewer environment interactions.
翻译:强化学习(RL)算法常因样本效率低下和泛化困难而著称。近期,无监督环境设计(UED)通过同时学习任务分布和智能体在生成任务上的策略,成为一种面向零样本泛化的新范式。这是一个非平稳过程,其中任务分布随智能体策略共同演化,导致随时间推移出现不稳定性。尽管以往研究展示了此类方法的潜力,但如何有效采样任务空间仍是制约这些方法的瓶颈挑战。为此,我们提出CLUTR:一种新颖的无监督课程学习算法,将任务表征与课程学习解耦为两阶段优化。该算法首先在随机生成任务上训练循环变分自编码器,以学习隐式任务流形;随后,教师智能体通过从该流形采样的隐式任务集上最大化基于最小最大遗憾值的优化目标来构建课程。通过使用固定预训练的任务流形,我们证明CLUTR成功克服了非平稳性问题并提升了稳定性。实验结果表明,在具有挑战性的CarRacing和导航环境中,CLUTR优于原则性强且流行的UED方法PAIRED:零样本泛化性能分别提升10.6倍和45%。同时,CLUTR在CarRacing任务上达到了与非UED先进方法相当的性能,且所需环境交互次数减少500倍。