Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.
翻译:Grokking现象,即在长时间训练后从记忆化到泛化的突然转变,暗示了存在具有不同统计特性的竞争性解盆地。我们通过奇异学习理论(SLT)的视角研究这一现象,该贝叶斯框架通过局部学习系数(LLC)来刻画损失景观的几何结构,LLC是损失曲面局部退化程度的度量。SLT将较低LLC的盆地与较高的后验质量集中度及较低的期望泛化误差联系起来。基于该理论,我们将二次网络中的grokking现象解释为竞争性近零损失解盆地之间的相变。我们的贡献包括两个方面:推导了在模运算任务上训练的二次网络中LLC的闭式表达式,并进行了相应的实验验证;同时提供了实验证据,表明LLC轨迹为追踪泛化动态和解释训练过程中的相变提供了可靠工具。