We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $σ^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $κ$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}κ \right) + \frac{σ^2}{T} \right)$ rate, without knowing $σ^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrtκ} \right) + \frac{σ^2}{T} \right)$ rate, without knowledge of $σ^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.
翻译:本文旨在使随机梯度下降(SGD)同时适应以下两方面因素:(i)随机梯度中的噪声σ²以及(ii)问题相关常数。针对条件数为κ的平滑且强凸函数的最小化问题,我们证明:在已知光滑性参数的前提下,采用指数递减步长的SGD算法经过T次迭代,即使未知σ²,仍能达到$\tilde{O} \left(\exp \left( \frac{-T}{κ} \right) + \frac{σ^2}{T} \right)$的收敛速率。为了自适应光滑性参数,我们引入随机线搜索(SLS)方法,并通过上下界分析表明:基于SLS的SGD算法虽能以期望速率收敛,但仅收敛至解的一个邻域内。另一方面,我们证明采用离线估计光滑性参数的SGD算法可收敛至精确最小值点,但收敛速率会随估计误差的增加而线性下降。进一步地,我们证明采用Nesterov加速与指数步长的SGD算法(称为ASGD)可在未知σ²的情况下达到近最优的$\tilde{O} \left(\exp \left( \frac{-T}{\sqrt{κ}} \right) + \frac{σ^2}{T} \right)$收敛速率。当使用光滑性参数与强凸参数的离线估计值时,ASGD仍能收敛至精确解,但收敛速率会相应减缓。最后,我们通过实验验证了指数步长结合新型SLS变体的有效性。