We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.
翻译:我们研究带有AdaGrad步长的随机梯度下降法:一种用于一阶随机优化的流行自适应(自调谐)方法。尽管该方法已被广泛研究,但现有分析存在各种不足:要么假设对问题参数有一定了解,要么施加严格的全局Lipschitz条件,要么未能给出高概率成立的界。我们在凸和非凸(光滑)情形下,对该基本方法进行了全面分析,消除了上述所有限制,额外支持一般的“仿射方差”噪声模型,并在低噪声和高噪声两种情形下提供尖锐的收敛速率。