Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.
翻译:神经缩放定律描述了深度神经网络性能如何随训练数据规模、模型复杂度和训练时间等关键因素按幂律行为在多个数量级上扩展。尽管这些缩放定律已得到经验观察,但其理论理解仍然有限。在本工作中,我们采用统计力学技术,在师生框架内分析单次随机梯度下降,其中学生和教师均为双层神经网络。我们的研究主要关注泛化误差及其对呈现幂律谱的数据协方差矩阵的响应行为。对于线性激活函数,我们推导了泛化误差的解析表达式,探索了不同的学习机制,并确定了幂律缩放出现的条件。此外,我们将分析扩展到特征学习机制中的非线性激活函数,研究数据协方差矩阵中的幂律谱如何影响学习动态。重要的是,我们发现对称平台的长度取决于数据协方差矩阵的不同特征值数量和隐藏单元数量,揭示了这些平台在不同配置下的行为。此外,我们的结果表明,当数据协方差矩阵具有幂律谱时,在专门化阶段会出现从指数收敛到幂律收敛的转变。这项工作有助于从理论上理解神经缩放定律,并为涉及复杂数据结构的实际场景中优化学习性能提供了见解。