In this paper, we investigate the convergence properties of the stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions. We develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, respectively. Under mild conditions, we prove the global convergence of our proposed framework in both single-timescale and two-timescale cases. We show that our proposed framework encompasses a wide range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD. Furthermore, when the objective function adopts a finite-sum formulation, we prove the convergence properties for these SGD-type methods based on our proposed framework. In particular, we prove that these SGD-type methods find the Clarke stationary points of the objective function with randomly chosen stepsizes and initial points under mild assumptions. Preliminary numerical experiments demonstrate the high efficiency of our analyzed SGD-type methods.
翻译:本文研究了随机梯度下降(SGD)方法及其变体在训练由非光滑激活函数构建的神经网络时的收敛性质。我们提出了一种新框架,该框架分别为更新动量项和变量分配不同的时间尺度步长。在温和条件下,我们证明了所提框架在单时间尺度和双时间尺度两种情况下的全局收敛性。我们表明,该框架涵盖了多种经典的SGD型方法,包括heavy-ball SGD、SignSGD、Lion、归一化SGD和裁剪SGD。进一步地,当目标函数采用有限和形式时,我们基于所提框架证明了这些SGD型方法的收敛性质。特别地,我们证明在温和假设下,这些SGD型方法能够通过随机选择的步长和初始点找到目标函数的Clarke稳定点。初步数值实验验证了所分析SGD型方法的高效性。