In this paper, we investigate the convergence properties of the stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions. We develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, respectively. Under mild conditions, we prove the global convergence of our proposed framework in both single-timescale and two-timescale cases. We show that our proposed framework encompasses a wide range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD. Furthermore, when the objective function adopts a finite-sum formulation, we prove the convergence properties for these SGD-type methods based on our proposed framework. In particular, we prove that these SGD-type methods find the Clarke stationary points of the objective function with randomly chosen stepsizes and initial points under mild assumptions. Preliminary numerical experiments demonstrate the high efficiency of our analyzed SGD-type methods.
翻译:本文研究了随机梯度下降(SGD)方法及其变体在训练由非光滑激活函数构建的神经网络时的收敛性质。我们提出了一个新框架,该框架为更新动量项和变量分别分配不同的时间尺度步长。在温和条件下,我们证明了所提框架在单时间尺度和双时间尺度情况下的全局收敛性。研究表明,我们的框架涵盖了多种著名的SGD类方法,包括重球SGD、SignSGD、Lion、归一化SGD和裁剪SGD。此外,当目标函数采用有限和形式时,我们基于所提框架证明了这些SGD类方法的收敛性质。特别地,我们证明了在温和假设下,这些SGD类方法能以随机选取的步长和初始点找到目标函数的Clarke平稳点。初步数值实验表明,我们分析的SGD类方法具有较高的效率。