Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the robustness of the neural network. From a robustness perspective, we show that the popular $l_2$ weight norm (metric) of the neural network is actually a sufficient condition for grokking. Based on the previous observations, we propose perturbation-based methods to speed up the generalization process. In addition, we examine the standard training process on the modulo addition dataset and find that it hardly learns other basic group operations before grokking, for example, the commutative law. Interestingly, the speed-up of generalization when using our proposed method can be explained by learning the commutative law, a necessary condition when the model groks on the test dataset. We also empirically find that $l_2$ norm correlates with grokking on the test data not in a timely way, we propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon and may be used to predict grokking.
翻译:近期,一种被称为“grokking”的有趣现象引起了广泛关注:模型在最初过拟合训练数据后,经过很长时间才出现泛化。我们试图从神经网络的鲁棒性角度理解这一看似反常的现象。从鲁棒性视角出发,我们证明神经网络的常用$l_2$权重范数(度量)实际上是grokking出现的充分条件。基于先前观察,我们提出基于扰动的方法来加速泛化过程。此外,我们研究了模加法数据集上的标准训练过程,发现模型在grokking之前几乎无法学习其他基本群运算,例如交换律。有趣的是,使用我们提出的方法加速泛化的原因可归结为交换律的学习——这是模型在测试数据集上实现grokking的必要条件。我们还通过实验发现$l_2$范数与测试数据上的grokking现象并非实时相关,因此提出了基于鲁棒性和信息论的新度量,发现这些新度量与grokking现象具有良好相关性,可能用于预测grokking的发生。