A key challenge in training generally-capable agents is the design of training tasks that facilitate broad generalization and robustness to environment variations. This challenge motivates the problem setting of Unsupervised Environment Design (UED), whereby a student agent trains on an adaptive distribution of tasks proposed by a teacher agent. A pioneering approach for UED is PAIRED, which uses reinforcement learning (RL) to train a teacher policy to design tasks from scratch, making it possible to directly generate tasks that are adapted to the agent's current capabilities. Despite its strong theoretical backing, PAIRED suffers from a variety of challenges that hinder its practical performance. Thus, state-of-the-art methods currently rely on curation and mutation rather than generation of new tasks. In this work, we investigate several key shortcomings of PAIRED and propose solutions for each shortcoming. As a result, we make it possible for PAIRED to match or exceed state-of-the-art methods, producing robust agents in several established challenging procedurally-generated environments, including a partially-observed maze navigation task and a continuous-control car racing environment. We believe this work motivates a renewed emphasis on UED methods based on learned models that directly generate challenging environments, potentially unlocking more open-ended RL training and, as a result, more general agents.
翻译:通用型智能体训练的关键挑战在于设计促进广泛泛化能力并增强对环境变化鲁棒性的训练任务。这一挑战推动了无监督环境设计(UED)问题场景的发展,在该框架下学生智能体在教师智能体提出的自适应任务分布上进行训练。PAIRED作为UED领域的开创性方法,通过强化学习训练教师策略从零开始设计任务,从而能够直接生成适配智能体当前能力的任务。尽管具备坚实的理论基础,PAIRED在实践中仍面临诸多阻碍性能表现的挑战。因此,当前最先进方法主要依赖任务筛选与变异而非生成新任务。本研究深入分析了PAIRED存在的若干核心缺陷,并针对每项缺陷提出解决方案。通过改进,我们使PAIRED能够达到甚至超越当前最优方法,在多个具有挑战性的程序化生成环境中(包括部分可观测迷宫导航任务与连续控制赛车环境)成功训练出鲁棒智能体。我们认为,本工作将推动基于学习模型直接生成挑战性环境的UED方法重获学界关注,这有望解锁更具开放性的强化学习训练范式,进而培育出更通用的智能体。