Neural scaling laws are observed in a range of domains, to date with no clear understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf's law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a theory of language-model scaling. We find that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf's-law exponents. In agreement with quanta scaling theory, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.
翻译:神经缩放定律在多个领域中被观察到,但至今尚无明确解释其成因。近期理论认为损失幂律源于齐普夫定律——一种在自然语言等领域中观察到的幂律分布。有理论指出,当按频率降序学习齐普夫分布的任务量子时,语言缩放定律便会出现。本文借助语言模型缩放理论,研究强化学习算法AlphaZero中的幂律缩放现象。我们发现训练与推理数据中的游戏状态遵循齐普夫定律(已知该定律源于环境的树状结构),并探究了缩放定律指数与齐普夫定律指数之间的相关性。与量子缩放理论一致,我们发现智能体按频率降序优化状态损失,尽管该顺序与建模复杂度呈反向缩放关系。我们还发现,模型性能随规模增大而下降的逆向缩放现象,与非常态的齐普夫曲线相关——终局状态成为最高频状态之一。证据表明,更大规模的模型会将注意力转向这些次要状态,从而牺牲对重要开局状态的理解能力。