In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and non-convex optimization with sub-Gaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. This method can be applied to the non-convex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1-\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. Additionally, we demonstrate that our techniques can be used to obtain high probability bound for AdaGrad-Norm (Ward et al., 2019) that removes the bounded gradients assumption from previous works. Furthermore, our technique for AdaGrad-Norm extends to the standard per-coordinate AdaGrad algorithm (Duchi et al., 2011), providing the first noise-adapted high probability convergence for AdaGrad.
翻译:本文描述了一种通用方法,用于在次高斯噪声下证明随机凸优化和非凸优化均具有高概率收敛性。在以往的凸优化研究中,收敛性要么仅以期望形式成立,要么依赖于定义域直径的界。相反,我们证明了依赖于初始解到最优解距离的高概率收敛界。所采用的步长与标准设定类似,且适用于Lipschitz函数、光滑函数及其线性组合。该方法可推广至非凸情形。对于SGD,我们证明了当迭代次数T已知时收敛率为$O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$;当T未知时收敛率为$O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$,其中$1-\delta$为期望的成功概率。这些界优于文献中的现有结果。此外,我们证明了该技术可用于获得AdaGrad-Norm(Ward等,2019)的高概率界,从而去除了先前工作中对梯度有界性的假设。进一步地,针对AdaGrad-Norm的技术可推广至标准的逐坐标AdaGrad算法(Duchi等,2011),首次为AdaGrad提供了自适应噪声的高概率收敛性。