Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.

翻译：在解决模运算任务的神经网络训练中，会出现"顿悟"现象，即模型在训练过程中达到100%训练准确率后很久，测试准确率才开始提升。这常被视为"涌现"的例证，即模型能力通过相变突然显现。本研究表明，顿悟现象并非神经网络或基于梯度下降优化的专属特性。具体而言，我们发现当使用递归特征机（RFM）学习模运算时也会出现该现象。RFM是一种迭代算法，利用平均梯度外积（AGOP）实现通用机器学习模型的任务特定特征学习。当与核机器结合使用时，RFM的迭代会导致测试准确率从接近零的随机状态快速跃迁至完美准确率。这种跃迁无法通过恒为零的训练损失预测，也无法通过在初始迭代中保持恒定的测试损失预测。相反，如我们所示，该跃迁完全由特征学习决定：RFM逐步学习块循环特征来解决模运算问题。与RFM结果相呼应，我们证明解决模运算的神经网络同样学习块循环特征。此外，我们提供理论证据表明RFM利用此类块循环特征实现傅里叶乘法算法——该算法在先前研究中被推测为神经网络在此类任务中学习的泛化解。我们的结果表明，涌现现象可纯粹源自任务相关特征的学习，并非神经架构或基于梯度下降优化方法的特有属性。同时，本研究进一步证明AGOP可作为神经网络特征学习的关键机制。