Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.
翻译:“顿悟”现象指的是:与训练性能在训练早期即达到峰值不同,模型的测试/泛化性能会在任意多个训练周期内停滞不前,然后突然跃升至通常接近完美的水平。在实践中,人们期望缩短这种停滞平台期的长度,即让学习过程更快地“顿悟”。在这项工作中,我们对顿悟现象提供了新的见解。首先,我们通过实证和理论分析表明,顿悟现象可以由(随机)梯度下降沿梯度不同主方向(即奇异方向)的不对称速度所诱发。接着,我们提出一种简单的改进方法,对梯度进行归一化,使得沿所有主方向的动态演化速度完全相同。我们证明,这种改进后的方法——我们称之为平等梯度下降(EGD),可被视为一种精心改进的自然梯度下降形式——能显著加速顿悟过程。事实上,在某些情况下,停滞现象被完全消除。最后,我们通过实证表明,在诸如模加法和稀疏奇偶校验问题这类经典算术问题上(这些问题的停滞现象已被广泛观察和深入研究),我们提出的方法能够消除平台期。