We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.
翻译:我们考虑在重尾噪声环境下,针对光滑但不一定凸的目标函数进行随机优化问题,其中随机梯度的噪声假设具有有界的p阶矩(p∈(1,2])。Zhang等人(2020)首次证明了收敛的Ω(T^{(1-p)/(3p-2)})下界(期望意义),并提出了达到此最优速率的简单裁剪算法。Cutkosky和Mehta(2021)提出另一种算法,证明其能达到近乎最优的高概率收敛保证O(log(T/δ)T^{(1-p)/(3p-2)}),其中δ为失败概率。然而,这一理想保证仅建立在随机梯度本身具有有界p阶矩这一额外假设下,该假设即使对二次目标函数和中心化高斯噪声也未必成立。本文首先改进了Cutkosky和Mehta(2021)算法的分析,在无需上述限制性假设的条件下,获得了相同的近乎最优高概率收敛速率O(log(T/δ)T^{(1-p)/(3p-2)})。其次,有趣的是,我们证明当目标函数F(x)具有形式E_{Ξ∼D}[f(x,Ξ)]——这可谓随机优化问题中应用最广泛的类别——时,仅需极少量结构即可获得比下界Ω(T^{(1-p)/(3p-2)})更快的速率。针对此类问题,我们提出首个方差缩减加速算法,并证明在温和条件下该算法保证具有O(log(T/δ)T^{(1-p)/(2p-1)})的高概率收敛速率,该速率快于Ω(T^{(1-p)/(3p-2)})。值得注意的是,即便特化到有限方差情形,我们的结果仍能给出(近乎)最优的高概率速率O(log(T/δ)T^{-1/3})。