In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
翻译:先前文献中,反向误差分析被用于寻找逼近梯度下降轨迹的常微分方程(ODEs)。研究发现,有限步长会隐式地正则化解,因为常微分方程中出现的项会惩罚损失梯度的二范数。我们证明了RMSProp和Adam中是否存在类似的隐式正则化取决于其超参数和训练阶段,但涉及不同的“范数”:相应的常微分方程项要么惩罚损失梯度的(扰动)一范数,要么相反地阻碍其减小(后一种情况较为典型)。我们还进行了数值实验,并讨论了所证明的事实如何影响泛化性能。