In this study, we revisit the convergence of AdaGrad with momentum (covering AdaGrad as a special case) on non-convex smooth optimization problems. We consider a general noise model where the noise magnitude is controlled by the function value gap together with the gradient magnitude. This model encompasses a broad range of noises including bounded noise, sub-Gaussian noise, affine variance noise and the expected smoothness, and it has been shown to be more realistic in many practical applications. Our analysis yields a probabilistic convergence rate which, under the general noise, could reach at (\tilde{\mathcal{O}}(1/\sqrt{T})). This rate does not rely on prior knowledge of problem-parameters and could accelerate to (\tilde{\mathcal{O}}(1/T)) where (T) denotes the total number iterations, when the noise parameters related to the function value gap and noise level are sufficiently small. The convergence rate thus matches the lower rate for stochastic first-order methods over non-convex smooth landscape up to logarithm terms [Arjevani et al., 2023]. We further derive a convergence bound for AdaGrad with mometum, considering the generalized smoothness where the local smoothness is controlled by a first-order function of the gradient norm.
翻译:本研究重新探讨了带动量的AdaGrad(涵盖AdaGrad作为特例)在非凸光滑优化问题上的收敛性。我们考虑一种通用噪声模型,其中噪声幅度由函数值间隙与梯度幅度共同控制。该模型涵盖广泛噪声类型,包括有界噪声、次高斯噪声、仿射方差噪声以及期望光滑性,并在众多实际应用中被证明更为现实。基于此通用噪声,我们的分析给出了概率收敛速率,可达(\tilde{\mathcal{O}}(1/\sqrt{T}))。该速率不依赖问题参数的先验知识,且当与函数值间隙和噪声水平相关的噪声参数足够小时,可加速至(\tilde{\mathcal{O}}(1/T)),其中(T)表示总迭代次数。这一收敛速率因此与随机一阶方法在非凸光滑地形上的下界速率相匹配(至多对数项误差)[Arjevani et al., 2023]。我们进一步推导了考虑广义光滑性(即局部光滑性由梯度范数的一阶函数控制)时带动量AdaGrad的收敛界。