Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expressible" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST, CIFAR10 and synthetic problems, and find a good quantitative agreement.
翻译:小批量随机梯度下降(Mini-batch SGD)结合动量法是训练大规模预测模型的基础算法。本文提出一种新的分析框架,用于研究恒定学习率、动量及批次大小下线性模型中小批量SGD的噪声平均特性。核心思想是考虑一类"谱可表达"近似中模型参数二阶矩的动力学行为,从而得到损失值序列生成函数的显式表达式。通过分析该生成函数,我们发现:1) SGD动力学根据问题的谱分布呈现多种收敛与发散状态;2) 收敛状态具有显式的稳定性条件,且在幂律谱分布情形下存在显式损失渐近形式;3) 最优收敛速率可在负动量条件下实现。我们通过MNIST、CIFAR10数据集及合成问题的广泛实验验证了理论预测,结果呈现良好定量吻合。