Unifying Learning Dynamics and Generalization in Transformers Scaling Law

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds -- statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument -- rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

翻译：扩展定律是大语言模型（LLM）发展的基石，它预测模型性能会随计算资源的增加而提升。然而，尽管该定律已得到实验验证，其理论基础仍未被充分理解。本文将基于Transformer的语言模型的学习动力学形式化为一个常微分方程组，并将该过程近似为核行为。与先前基于简化玩具模型的分析不同，我们严格分析了在任意数据分布下、序列到序列数据上训练的多层Transformer的随机梯度下降（SGD）过程，这更贴近真实场景。我们的分析刻画了当计算资源随数据规模扩展时，泛化误差向不可约风险收敛的过程（尤其在优化阶段）。我们建立了过量风险的上界与匹配下界，其由显著的相变过程决定。在初始优化阶段，过量风险关于计算成本${\sf C}$呈指数级衰减。然而，一旦资源分配超过特定阈值，系统进入统计阶段，此时泛化误差遵循幂律衰减$Θ(\mathsf{C}^{-1/7})$。这些速率由互补下界所保证——统计方面通过信息论两点归约方法，优化方面通过一阶预言机论证——使得两阶段律在常数因子、对数因子及条件数差距范围内达到紧致。除统一框架外，我们的理论还推导了模型规模、训练时间和数据集规模的独立扩展定律，阐明了各变量如何独立主导泛化边界。