In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.
翻译:在某些设定下,神经网络会展现出一种被称为grokking的现象,即模型在训练集达到完美或近乎完美的准确率之后很久,才在验证集上实现相同的性能。本文发现,grokking不仅局限于神经网络,还会出现在其他设定中,例如高斯过程(GP)分类、GP回归和线性回归。我们还揭示了一种机制,即通过添加包含虚假信息的维度,可以在算法数据集上诱发grokking。该现象在非神经架构中的存在表明,grokking并非随机梯度下降(SGD)或权重范数正则化所特有。相反,在任何解搜索由复杂度与误差引导的设定中,grokking都可能发生。基于这一见解以及我们在贝叶斯神经网络(BNN)和GP回归模型训练轨迹中观察到的进一步趋势,我们朝着建立更通用的grokking理论取得了进展。具体而言,我们假设该现象受误差与复杂度景观中特定区域的可达性所支配。