In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, on the contrary, hinder its decrease (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
翻译:在以往的文献中,反向误差分析被用于寻找近似梯度下降轨迹的常微分方程。研究发现,有限步长会隐式地正则化解,因为常微分方程中出现的项会对损失梯度的二范数施加惩罚。我们证明,RMSProp和Adam中是否存在类似的隐式正则化取决于其超参数和训练阶段,但涉及不同的“范数”:相应的常微分方程项要么惩罚损失梯度的(扰动)一范数,要么反而阻碍其减小(后者是典型情况)。我们还进行了数值实验,并讨论了这些已证明的事实如何影响泛化能力。