Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: https://github.com/HarryMJMead/Dynamic-Environment-Generation-for-UED.
翻译:无监督环境设计旨在为强化学习智能体自动生成训练课程,以提升其泛化能力和零样本性能。然而,设计有效的课程仍然是一个难题,尤其是在某些环境参数化的小子集会显著增加所需策略复杂度的场景中。现有方法难以解决复杂的信用分配问题,且依赖的遗憾近似方法无法有效识别具有挑战性的环境层级,这两个问题随着环境规模的扩大而加剧。我们提出面向无监督环境设计的动态环境生成方法,以实现更密集的层级生成器奖励信号,从而降低信用分配的难度,并使无监督环境设计能够扩展至更大规模的环境。我们还引入了一种新的遗憾近似方法——最大化负优势值,作为显著改进的优化指标,能够更准确地识别更具挑战性的环境层级。实验结果表明,最大化负优势值优于现有的遗憾近似方法,且与动态环境生成方法结合后,其性能始终超越现有方法,尤其是在环境规模增大的情况下。我们已将全部代码公开于此:https://github.com/HarryMJMead/Dynamic-Environment-Generation-for-UED。