Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
翻译:神经缩放定律(NSL)指的是模型性能随规模提升而改善的现象。Sharma与Kaplan利用逼近理论分析了NSL,并预测均方误差损失按$N^{-\alpha}$衰减,其中$\alpha=4/d$,$N$为模型参数量,$d$为输入数据的内在维度。尽管该理论在某些情形(如ReLU网络)中表现良好,但我们意外发现一个简单的一维问题$y=x^2$呈现出与其预测($\alpha=4$)不同的缩放规律($\alpha=1$)。我们打开神经网络后发现,这种新缩放规律源于随机门票集成:更宽的网络平均拥有更多"随机门票",这些门票通过集成降低输出方差。我们通过机械性解释单个神经网络和统计研究两种方式支持了这一集成机制,并将$N^{-1}$缩放规律归因于随机门票的"中心极限定理"。最后,我们讨论了该发现对大型语言模型及统计物理型学习理论的潜在意义。