Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.
翻译:近期关于grokking现象的研究揭示了神经网络训练动态及其泛化行为的复杂性。Grokking指网络在经历长时间过拟合阶段(此时网络完美拟合训练集)后,测试集泛化精度突然急剧上升的现象。现有研究主要关注浅层网络(如2层MLP和1层Transformer),本文则探索深度网络(如12层MLP)中的grokking现象。我们通过实验复现了该现象,发现深度神经网络可能比浅层网络更容易出现grokking。同时,在增加MLP模型深度时,我们观察到测试精度呈现二次攀升的多阶段泛化现象,这在浅层模型中极为罕见。我们进一步发现,在grokking过程中特征秩的下降与从过拟合阶段到泛化阶段的相变存在显著对应关系。此外,多阶段泛化现象往往与特征秩的双下降模式相吻合。这些观察表明,与权重范数相比,内部特征秩可能成为衡量模型泛化行为的更有效指标。本研究首次系统探索深度神经网络中的grokking现象,并深入探究特征秩与泛化性能的关联机制。