We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.
翻译:我们研究了采用AdaGrad步长的随机梯度下降法——一种流行的用于一阶随机优化的自适应(自调节)方法。尽管该方法已被广泛研究,但现有分析存在多种缺陷:要么假设已知部分问题参数,要么施加严格的全局Lipschitz条件,要么无法给出高概率意义上的界。我们对该基本方法提供了全面的分析,在凸性和非凸性(光滑)情形下均消除了上述限制,同时支持一般的“仿射方差”噪声模型,并在低噪声和高噪声条件下给出了锐利的收敛速率。