The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $\widetilde{O}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods $\unicode{x2013}$ Normalized SGD (NSGD), AMSGrad, and AdaGrad $\unicode{x2013}$ unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.
翻译:经典分析中,采用多项式衰减步长 $\eta_t = \eta/\sqrt{t}$ 的随机梯度下降(SGD)依赖于精心调优的参数 $\eta$,而该参数取决于李普希兹光滑常数等问题参数,这在实践中往往未知。本文证明,对于任意 $\eta > 0$ 的SGD(称为未调优SGD),在最小化光滑目标函数时,仍能达到关于梯度范数的最优收敛阶 $\widetilde{O}(T^{-1/4})$。然而,这一结果以对光滑常数的灾难性指数依赖为代价——我们证明,即使在无噪声场景下,这种指数依赖对该方案而言不可避免。随后,我们研究了三类自适应方法——归一化SGD(NSGD)、AMSGrad与AdaGrad——揭示了它们在缺乏光滑参数信息及随机梯度有界性条件时,能够消除这种指数依赖的能力。我们的结果为自适应方法优于未调优SGD(在缓解大梯度问题方面)提供了理论依据。