One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at \url{https://github.com/ironjr/grokfast}.
翻译:机器学习中一种被称为“顿悟”的奇特现象表现为:模型在训练数据上达到近乎完美的过拟合后,需经过数十倍迭代才能实现延迟泛化。针对机器学习从业者关注的这种长延迟问题,我们的目标是在顿悟现象下加速模型的泛化过程。通过将参数在训练迭代中的梯度序列视为随时间变化的随机信号,我们可以在谱分析上将梯度下降中的参数轨迹分解为两个分量:快速变化导致过拟合的分量,以及缓慢变化诱导泛化的分量。该分析使我们能够仅用几行代码放大梯度的慢变分量,从而将顿悟现象加速超过 $\times 50$ 倍。实验表明,我们的算法适用于涉及图像、语言和图形的多样化任务,使得这种突发泛化的特殊现象具备实际应用价值。代码发布于 \url{https://github.com/ironjr/grokfast}。