Grokking as the Transition from Lazy to Rich Training Dynamics

We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.

翻译：我们提出，grokking现象——即神经网络的训练损失远早于测试损失下降——可能源于神经网络从懒散训练动态向丰富的特征学习机制的转变。为阐明这一机制，我们在多项式回归问题中研究了使用两层神经网络进行普通梯度下降的简单设置，该网络在无正则化条件下表现出grokking现象，且无法用现有理论解释。我们识别了此类网络测试损失的充分统计量，并在训练过程中跟踪这些统计量，结果表明：在该场景下，grokking的出现源于网络首先利用初始特征拟合核回归解，随后在训练损失已处于低位时进行后期特征学习，从而找到泛化解。我们发现，grokking的关键决定因素是特征学习速率（可通过缩放网络输出的参数精确控制）以及初始特征与目标函数$y(x)$的对齐程度。我们论证这种延迟泛化发生在以下条件同时满足时：（1）初始神经正切核的顶部特征向量与任务标签$y(x)$不对齐；（2）数据集规模足够大，使得网络最终可能泛化，但又不至于大到训练损失在所有轮次均完美跟踪测试损失；（3）网络在懒散机制下开始训练，因此不会立即学习特征。最后，我们提供证据表明，从懒散（线性模型）到丰富训练（特征学习）的转变可以在更一般的场景中控制grokking，例如在MNIST、单层Transformer以及师生网络上。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日