Broken Neural Scaling Laws

We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, molecules, computer programming/coding, math word problems, "emergent" "phase transitions / changes", arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws

翻译：我们提出了一种平滑断开的幂律函数形式（我们称之为非标准神经缩放定律（BNSL）），该形式能够精确建模并外推深度神经网络的缩放行为（即评估指标如何随训练计算量、模型参数数量、训练数据集大小、模型输入尺寸、训练步数或上游性能变化），适用于各种架构以及大型多样化上下游任务集中的每个任务，涵盖零样本、提示学习和微调场景。该任务集包括大规模视觉、语言、音频、视频、扩散模型、生成建模、多模态学习、对比学习、AI对齐、机器人学、分布外（OOD）泛化、持续学习、迁移学习、不确定性估计/校准、分布外检测、对抗鲁棒性、蒸馏、稀疏性、检索、量化、剪枝、分子、计算机编程/编码、数学应用题、“涌现”“相变/变化”、算术、无监督/自监督学习以及强化学习（单智能体和多智能体）。与神经缩放行为的其他函数形式相比，该函数形式在此任务集上生成的缩放行为外推结果显著更准确。此外，该函数形式能够精确建模并外推其他函数形式无法表达的缩放行为，例如双下降现象中的非单调转换，以及算术等任务缩放行为中出现的延迟尖锐拐点。最后，我们利用该函数形式深入理解了缩放行为可预测性的极限。代码可在 https://github.com/ethancaballero/broken_neural_scaling_laws 获取。