Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
翻译:深度学习中的缩放定律——模型性能随资源增长的经验幂律关系——已作为跨架构、数据集和任务的简单而显著的规律性出现。这些定律在指导最先进模型设计方面尤其具有影响力,因为它们量化了增加数据或模型规模的收益,并暗示了机器学习可解释性的基础。然而,大多数研究关注训练结束时的渐近行为。在这项工作中,我们通过分析整个训练动态来描述更丰富的图景:我们识别出两个新颖的\textit{动态}缩放定律,这些定律控制性能如何随不同基于范数的复杂度度量演变。结合使用,我们的新定律恢复了收敛时测试误差的已知缩放规律。我们的发现在CNN、ResNet以及在MNIST、CIFAR-10和CIFAR-100上训练的Vision Transformer中保持一致。此外,我们利用逻辑损失训练的单层感知机提供了分析支持,从中推导出新的动态缩放定律,并通过梯度训练引起的隐含偏见进行解释。