Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features -- one per singular value of the data covariance -- the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

翻译：深度神经网络在L2正则化强度变化下表现出一级相变，每次相变标志新可学习特征的涌现。当正则化强度低于临界值时，所有特征在原则上均可学习，但共存的亚稳态被能量势垒分隔，可能束缚网络并阻碍收敛。深度神经网络的核心优势在于泛化能力，然而诸多未解之谜仍有待探索，其中典型的便是"顿悟"现象：在长期明显的过拟合之后，泛化能力突然延迟涌现。我们在线性深度神经网络中证明，顿悟现象与L2一级相变中的滞后效应一致：通过L2正则化人为制造网络束缚，我们证实处于低精度亚稳态的模型仅当SGD噪声驱动其跨越能量势垒时才能逃离，逃离时间遵循阿伦尼乌斯标度。通过人为将模型束缚于亚稳态，我们在两个数量级的时间尺度上复现了类似顿悟的延迟收敛现象。采用稀疏子采样方法，我们同样复现了经典顿悟曲线——测试误差最终趋近于最终训练误差。研究表明，亚稳态数量等于可学习特征数量（即数据协方差矩阵的每个奇异值对应一个特征），且任务复杂度自然增长时滞后效应潜力随之增强。我们提供的证据表明，相同机制可能普遍存在于非线性深度神经网络中。上述结果为开发更高效的学习方案开辟了新路径。