We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.
翻译:我们在经典的岭回归框架下研究“顿悟”现象——即在过拟合发生很久之后才出现的泛化能力跃升。我们证明了在使用权重衰减梯度下降法学习过参数化线性回归模型时,端到端的顿悟现象会发生。具体而言,我们证明了以下阶段依次出现:(i)模型在训练早期对训练数据过拟合;(ii)在过拟合显现后很长一段时间内,泛化能力持续较差;(iii)最终泛化误差变得任意小。此外,我们通过理论和实验表明,通过适当的超参数调整,可以以有原则的方式放大或消除顿悟现象。据我们所知,这些是关于泛化延迟(我们称之为“顿悟时间”)随训练超参数变化的首次严格定量界限。最后,超越线性设定,我们通过实验证明我们的定量界限也能捕捉非线性神经网络上顿悟现象的行为。我们的结果表明,顿悟并非深度学习的固有失效模式,而是特定训练条件的结果,因此无需对模型架构或学习算法进行根本性改变即可避免。