We propose the $\textit{Quantization Model}$ of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the $\textit{Quantization Hypothesis}$, where learned network capabilities are quantized into discrete chunks ($\textit{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model internals, we auto-discover diverse model capabilities (quanta) and find tentative evidence that the distribution over corresponding subproblems in the prediction of natural text is compatible with the power law predicted from the neural scaling exponent as predicted from our theory.
翻译:我们提出神经缩放定律的$\textit{量化模型}$,该模型既解释了损失随模型和数据规模变化的经验幂律下降现象,也阐释了随规模增大新能力突然涌现的机制。该模型基于我们提出的$\textit{量化假说}$,其中学习到的网络能力被量化为离散单元($\textit{量子}$)。研究表明,当量子按使用频率降序学习时,使用频率的幂律分布可解释损失的经验幂律缩放行为。我们在玩具数据集上验证了这一预测,进而研究了大语言模型的缩放曲线分解方式。通过利用语言模型内部机制,我们自动发现了多样化的模型能力(量子),并发现初步证据表明:自然语言预测中对应子问题的分布,与根据理论预测的神经缩放指数所导出的幂律分布相吻合。