StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution. It also has an active professional competitive scene. StarCraft II is uniquely suited for advancing offline RL algorithms, both because of its challenging nature and because Blizzard has released a massive dataset of millions of StarCraft II games played by human players. This paper leverages that and establishes a benchmark, called AlphaStar Unplugged, introducing unprecedented challenges for offline reinforcement learning. We define a dataset (a subset of Blizzard's release), tools standardizing an API for machine learning methods, and an evaluation protocol. We also present baseline agents, including behavior cloning, offline variants of actor-critic and MuZero. We improve the state of the art of agents using only offline data, and we achieve 90% win rate against previously published AlphaStar behavior cloning agent.
翻译:星际争霸II是最具挑战性的模拟强化学习环境之一;它具有部分可观测性、随机性、多智能体性,且掌握星际争霸II需要跨越长时间跨度进行战略规划,同时具备实时低层执行能力。该游戏还拥有活跃的职业竞技领域。星际争霸II特别适合推动离线强化学习算法的发展,这既源于其高难度特性,也因暴雪公司已发布包含数百万场人类玩家对局的星际争霸II海量数据集。本文利用该数据集建立名为"AlphaStar Unplugged"的基准测试,为离线强化学习引入了前所未有的挑战。我们定义了一个数据集(暴雪发布数据的子集)、一套用于机器学习方法标准化的API工具,以及一个评估协议。同时我们提出了基线智能体,包括行为克隆、演员-评论家的离线变体及MuZero算法。通过仅使用离线数据,我们改进了现有最优智能体技术,并实现了对先前发布的AlphaStar行为克隆智能体90%的胜率。