In this study, we revisit the convergence of AdaGrad with momentum (covering AdaGrad as a special case) on non-convex smooth optimization problems. We consider a general noise model where the noise magnitude is controlled by the function value gap together with the gradient magnitude. This model encompasses a broad range of noises including bounded noise, sub-Gaussian noise, affine variance noise and the expected smoothness, and it has been shown to be more realistic in many practical applications. Our analysis yields a probabilistic convergence rate which, under the general noise, could reach at (\tilde{\mathcal{O}}(1/\sqrt{T})). This rate does not rely on prior knowledge of problem-parameters and could accelerate to (\tilde{\mathcal{O}}(1/T)) where (T) denotes the total number iterations, when the noise parameters related to the function value gap and noise level are sufficiently small. The convergence rate thus matches the lower rate for stochastic first-order methods over non-convex smooth landscape up to logarithm terms [Arjevani et al., 2023]. We further derive a convergence bound for AdaGrad with mometum, considering the generalized smoothness where the local smoothness is controlled by a first-order function of the gradient norm.
翻译:本研究重新探讨了带动量的AdaGrad(涵盖AdaGrad作为特例)在非凸光滑优化问题上的收敛性。我们考虑一种广义噪声模型,其中噪声幅度由函数值间隙与梯度幅值共同控制。该模型涵盖了包括有界噪声、次高斯噪声、仿射方差噪声及期望光滑性在内的广泛噪声类型,并被证明在许多实际应用中更为符合现实。我们的分析得出了概率收敛速率,在广义噪声条件下可达(\tilde{\mathcal{O}}(1/\sqrt{T}))。该速率不依赖于问题参数的先验知识,且当与函数值间隙及噪声水平相关的噪声参数足够小时,可加速至(\tilde{\mathcal{O}}(1/T)),其中(T)表示总迭代次数。此收敛速率在忽略对数项的情况下,匹配了非凸光滑场景下随机一阶方法的下界速率[Arjevani et al., 2023]。我们进一步推导了带动量AdaGrad在广义光滑条件下的收敛界,其中局部光滑性由梯度范数的一阶函数控制。