The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.
翻译:现代机器学习模型的训练通常涉及求解受大规模数据影响的高维非凸优化问题。在此背景下,基于动量的随机优化算法已变得尤为普及。随机性源于数据子采样,这降低了计算成本。动量与随机性均有助于算法实现全局收敛。在本工作中,我们提出并分析了一种带动量的随机梯度下降的连续时间模型。该模型是一个分段确定性马尔可夫过程,将优化器表示为欠阻尼动力系统,并通过随机切换实现数据子采样。我们研究了长时间极限、子采样到无子采样的极限以及动量到无动量的极限。我们特别关注动量随时间衰减的情形。在凸性假设下,我们证明了当动量随时间衰减且子采样率趋于无穷时,该动力系统收敛到全局最小点。随后,我们提出了一种稳定的辛离散化方案,以从连续时间动力系统构建算法。在实验中,我们在凸与非凸测试问题上研究了该方案。此外,我们在图像分类问题中训练了一个卷积神经网络。与带动量的随机梯度下降相比,我们的算法取得了具有竞争力的结果。