In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, on the contrary, hinder its decrease (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
翻译:在以往文献中,反向误差分析被用于寻找近似梯度下降轨迹的常微分方程。研究发现,有限步长会隐式正则化解空间,因为常微分方程中出现的项会惩罚损失梯度的二范数。我们证明了RMSProp和Adam中类似隐式正则化的存在性取决于其超参数和训练阶段,但涉及不同的“范数”:对应的常微分方程项要么惩罚(扰动后的)损失梯度的一范数,要么反而阻碍其下降(后者为典型情况)。我们还进行了数值实验,并讨论了这些已证明事实如何影响泛化性能。