The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.
翻译:神经幂律标度关系的最新观测对深度学习领域产生了重大影响。由此,大量研究聚焦于描述标度定律,尽管主要集中在监督学习领域,对强化学习框架的探讨相对有限。本文对核心强化学习算法AlphaZero的性能标度进行了深入研究。基于Elo等级分、棋力与幂律标度之间的关系,我们在四子棋和五子棋游戏中训练了AlphaZero智能体并分析其性能。研究发现:当计算资源不构成瓶颈时,玩家棋力随神经网络参数量呈幂律标度;当训练最优规模的智能体时,则随计算量呈幂律关系。两种游戏展现出近乎相同的标度指数。结合这两个观测到的标度定律,我们得到了一个类似语言模型的最优规模与计算量的幂律关系。我们发现,最优神经网络规模的预测标度与两种游戏的数据拟合一致。该标度定律表明,在给定计算预算的条件下,此前发表的最先进博弈模型规模显著小于最优规模。我们还发现,大型AlphaZero模型具有更高的样本效率,在相同训练数据量下优于小型模型。