Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).B.1
翻译:尽管带动量的梯度下降在现代深度学习中广泛应用,但其对训练轨迹的具体影响仍难以捉摸。在本研究中,我们通过实验证明,结合大学习率与学习率预热机制的动量梯度下降会呈现显著的“大弹射”现象,使迭代点比普通梯度下降更趋近于平坦极小值。随后,我们提供实验证据与理论直觉,表明这种大弹射现象源于动量对自稳定效应(Damian等,2023)的“放大”作用。