In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.
翻译:本文首先解释了神经网络使用随机梯度下降(SGD)训练时训练损失中频繁出现尖峰的原因。我们提供证据表明,SGD训练损失中的尖峰是一种"弹射"(catapult)现象,该优化现象最初由[Lewkowycz等,2020]在大学习率的梯度下降(GD)中发现。我们通过实验证明,无论是GD还是SGD,这些弹射都发生在由切线核(tangent kernel)最大特征向量张成的低维子空间中。其次,我们提出了一种解释,阐明弹射如何通过促进特征学习来提升泛化性能:弹射通过增加与真实预测器的平均梯度外积(AGOP)的对齐度来促进特征学习。此外,我们证明SGD中较小的批大小会诱发更多数量的弹射,从而改善AGOP对齐度和测试性能。