Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.
翻译:缩放定律在现代人工智能革命中发挥了重要作用,为实践者提供了模型性能如何随数据量、计算量和模型参数数量增加而改善的预测能力。这激起了人们对神经缩放定律起源的浓厚兴趣,一种普遍观点认为它们源于数据中已存在的幂律结构。本文研究了Transformer模型在具有可调复杂度的图上预测随机游走(二元语法)时产生的缩放定律。我们证明,即使在数据相关性中不存在幂律结构的情况下,这种简化设定仍会产生神经缩放定律。我们进一步通过系统性地降低自然语言的复杂度进行研究:使用从逐步简化的生成语言模型(从4层、2层、1层Transformer语言模型直至语言二元语法模型)中采样的序列进行训练,揭示了缩放指数的单调演化规律。我们的结果还包括在从Erdös-Renyi随机图和标度无关Barabási-Albert系综中抽取的随机图上进行随机游走训练所获得的缩放定律。最后,我们重新审视了语言建模的传统缩放定律,证明使用上下文长度为50的2层Transformer即可复现若干关键结果;对先前文献中使用的多种拟合方法进行了批判性分析;展示了与当前已发表文献实践相比获取计算最优曲线的替代方法;并提供了初步证据表明最大更新参数化可能比标准参数化具有更高的参数效率。