Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
翻译:神经标度律(NSL)指模型性能随规模提升的现象。Sharma与Kaplan运用逼近理论分析NSL,预测均方误差(MSE)损失按$N^{-\alpha}$($\alpha=4/d$)衰减,其中$N$为模型参数量,$d$为输入数据内蕴维度。尽管该理论在部分场景(如ReLU网络)表现良好,我们惊奇地发现简单一维问题$y=x^2$呈现出不同于其预测($\alpha=4$)的标度律($\alpha=1$)。通过解剖神经网络,我们发现新标度律源自彩票集成:宽网络平均包含更多"彩票",这些彩票经集成可降低输出方差。我们通过机制性解释单个神经网络及统计研究两种方式支持该集成机制,并将$N^{-1}$标度律归因于彩票的"中心极限定理"。最后,我们讨论该发现对大型语言模型及统计物理式学习理论的潜在启示。