In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.
翻译:本文研究动量对梯度下降优化轨迹的影响。我们采用连续时间方法分析步长为$\gamma$、动量参数为$\beta$的动量梯度下降,由此识别出内在量$\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$,该量唯一确定优化路径并提供简单的加速规则。在过参数化回归场景下训练两层对角线性网络时,我们通过隐式正则化问题刻画恢复的解,进而证明较小的$\lambda$值有助于恢复稀疏解。最后,我们给出随机动量梯度下降的类似但较弱的结果,并提供数值实验支持理论论断。