We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.
翻译:我们提出了一个超越核极限的可解析神经缩放定律模型。对该模型的理论分析揭示了性能如何随模型规模、训练时间及可用数据总量而缩放。我们识别出对应于不同任务难度的三种缩放机制:困难任务、简单任务与超简单任务。对于位于初始无限宽神经正切核所定义的再生核希尔伯特空间内的简单与超简单目标函数,特征学习模型与核机制模型之间的缩放指数保持不变。对于被定义为位于初始神经正切核再生核希尔伯特空间之外的困难任务,我们通过理论分析与实验证明,特征学习能够改善训练时间与计算资源的缩放效率,使困难任务的缩放指数提升近一倍。这导致在特征学习机制中,参数规模与训练时间的计算最优策略发生根本改变。我们通过非线性多层感知器拟合圆环上具有幂律傅里叶谱的函数,以及卷积神经网络学习视觉任务的实验,证实了特征学习仅能改善困难任务而非简单与超简单任务缩放规律的结论。