Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy.
翻译:Power等人(2022)的近期工作揭示了算术任务学习中的一个惊人“顿悟”现象:神经网络首先“记忆”训练集,实现训练准确率完美但测试准确率近乎随机,而在训练充分延长后,其测试准确率突然跃升至完美水平。本文在理论框架下研究顿悟现象,证明该现象可由早期与晚期隐式偏好的二元性引发。具体而言,当使用大初始化与小权重衰减训练同质神经网络处理分类与回归任务时,我们证明:训练过程会长时间被困在对应核预测器的解附近,随后发生向最小范数/最大间隔预测器的急剧转变,导致测试准确率产生戏剧性变化。