Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real world is highly open-ended, featuring endlessly evolving environments and challenges, making such RL approaches unsuitable. Simply randomizing over simulated environments is insufficient, as it requires making arbitrary distributional assumptions and can be combinatorially less likely to sample specific environment instances that are useful for learning. An ideal learning process should automatically adapt the training environment to maximize the learning potential of the agent over an open-ended task space that matches or surpasses the complexity of the real world. This thesis develops a class of methods called Unsupervised Environment Design (UED), which aim to produce such open-ended processes. Given an environment design space, UED automatically generates an infinite sequence or curriculum of training environments at the frontier of the learning agent's capabilities. Through extensive empirical studies and theoretical arguments founded on minimax-regret decision theory and game theory, the findings in this thesis show that UED autocurricula can produce RL agents exhibiting significantly improved robustness and generalization to previously unseen environment instances. Such autocurricula are promising paths toward open-ended learning systems that achieve more general intelligence by continually generating and mastering additional challenges of their own design.
翻译:深度强化学习为训练最优顺序决策智能体提供了强大方法。由于收集真实世界交互可能带来额外成本和安全风险,模拟到现实的通用范式先在模拟器中进行训练,随后部署至真实世界。然而,强化学习智能体容易过度适应模拟训练环境的选择,更糟糕的是,学习过程在智能体掌握特定模拟环境集后便终止。相比之下,真实世界具有高度开放性,包含不断演化的环境和挑战,这使得上述强化学习方法难以适用。对模拟环境进行简单随机化处理并不足够,因为这需要做出任意分布假设,且组合概率上更难以采样到对学习有用的特定环境实例。理想的学习过程应能自动调整训练环境,以最大化智能体在匹配或超越现实世界复杂度的开放式任务空间中的学习潜力。本论文提出一类名为无监督环境设计的方法,旨在构建此类开放式过程。给定环境设计空间后,UED能在智能体能力边界处自动生成无限序列的训练环境课程。通过基于最小最大遗憾决策理论和博弈论的广泛实证研究与理论论证,本论文发现UED自动课程能够显著提升强化学习智能体对未见环境实例的鲁棒性和泛化能力。这类自动课程有望成为通向开放式学习系统的有效路径——通过持续生成并掌握自设计的新挑战,从而实现更通用的智能。