In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.
翻译:在无监督环境设计中,强化学习智能体在由对抗者生成的能最大化某种目标的环境配置(关卡)上进行训练。遗憾是一种常用的目标,理论上能产生具有理想鲁棒性保证的极小化最大遗憾(MMR)策略;特别是,智能体的最大遗憾是有界的。然而,一旦智能体在所有关卡上都达到此遗憾界,对抗者将仅采样那些遗憾无法进一步减少的关卡。尽管在这些遗憾最大化的关卡之外可能存在性能改进的空间,但学习过程会停滞。在这项工作中,我们引入了贝叶斯完美关卡MMR(BLP),这是对极小化最大遗憾目标的一种精炼,克服了这一限制。我们形式化地证明了求解该目标能得到MMR策略的一个子集,并且BLP策略在所有关卡上表现得与完美贝叶斯策略一致。我们还进一步引入了一种算法ReMiDi,该算法在收敛时能产生BLP策略。我们通过实验证明,在来自极小化最大遗憾对抗者的关卡上训练会导致学习过早停滞,而ReMiDi能够继续学习。