We introduce Autoverse, an evolvable, domain-specific language for single-player 2D grid-based games, and demonstrate its use as a scalable training ground for Open-Ended Learning (OEL) algorithms. Autoverse uses cellular-automaton-like rewrite rules to describe game mechanics, allowing it to express various game environments (e.g. mazes, dungeons, sokoban puzzles) that are popular testbeds for Reinforcement Learning (RL) agents. Each rewrite rule can be expressed as a series of simple convolutions, allowing for environments to be parallelized on the GPU, thereby drastically accelerating RL training. Using Autoverse, we propose jump-starting open-ended learning by imitation learning from search. In such an approach, we first evolve Autoverse environments (their rules and initial map topology) to maximize the number of iterations required by greedy tree search to discover a new best solution, producing a curriculum of increasingly complex environments and playtraces. We then distill these expert playtraces into a neural-network-based policy using imitation learning. Finally, we use the learned policy as a starting point for open-ended RL, where new training environments are continually evolved to maximize the RL player agent's value function error (a proxy for its regret, or the learnability of generated environments), finding that this approach improves the performance and generality of resultant player agents.
翻译:本文介绍Autoverse,一种面向单玩家二维网格游戏的可演化领域特定语言,并展示其作为开放式学习算法可扩展训练平台的应用价值。Autoverse采用类细胞自动机的重写规则描述游戏机制,能够表达多种强化学习智能体常用测试环境(如迷宫、地下城、推箱子谜题)。每条重写规则均可表示为系列简单卷积运算,使得环境可在GPU上并行化处理,从而大幅加速强化学习训练。基于Autoverse,我们提出通过搜索模仿学习实现开放式学习的快速启动:首先演化Autoverse环境(包括规则与初始地图拓扑),以最大化贪婪树搜索发现新最优解所需的迭代次数,从而生成复杂度递增的环境与游戏轨迹课程;随后通过模仿学习将这些专家游戏轨迹蒸馏至基于神经网络的策略;最后将习得策略作为开放式强化学习的起点,在此过程中持续演化新的训练环境以最大化玩家智能体价值函数误差(作为其遗憾值或生成环境可学习性的代理指标)。实验表明该方法能有效提升最终玩家智能体的性能与泛化能力。