Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Python-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a single GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Craftax benchmark, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planning and memory, as well as continual adaptation to novel situations as more of the world is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.
翻译:基准测试在强化学习算法的开发与分析中扮演着关键角色。我们发现当前用于开放式学习研究的现有基准测试可分为两类:一类(如Crafter、NetHack和Minecraft)因运行速度过慢,需消耗巨大计算资源才能开展有效研究;另一类(如Minigrid和Procgen)则复杂度不足,难以构成实质性挑战。为解决此问题,我们首先提出Craftax-Classic:这是基于JAX对Crafter进行的底层重构版本,其运行速度比原生Python版本提升高达250倍。使用10亿次环境交互的PPO算法在单GPU上运行不足一小时即可完成,且平均能获得90%的最优奖励。为提供更具挑战性的测试环境,我们推出核心的Craftax基准测试平台,该平台在Crafter机制基础上进行重大扩展,并融入NetHack的设计元素。解决Craftax任务需要深度探索、长期规划与记忆能力,以及随着世界探索进程对新情境的持续适应。研究表明,现有方法(包括全局探索、情景式探索及无监督环境设计)均未能在该基准测试中取得实质性进展。我们相信Craftax首次使研究者能够在有限计算资源下,于复杂开放环境中开展实验研究。