Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise

We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.

翻译：本文考虑目标函数光滑但不一定凸的随机优化问题，其处于重尾噪声框架下，即随机梯度的噪声被假设为具有有界的$p$阶矩（$p\in(1,2]$）。Zhang等人（2020）首次证明了收敛（期望意义下）的下界$\Omega(T^{\frac{1-p}{3p-2}})$，并给出了一种能匹配该最优速率的简单截断算法。Cutkosky与Mehta（2021）提出了另一种算法，该算法被证明可实现接近最优的高概率收敛保证$O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$（其中$\delta$为失败概率）。然而，这一理想的保证仅在额外假设随机梯度本身具有有界的$p$阶矩时才成立，而该假设甚至对于二次目标和中心化高斯噪声也无法成立。本文首先改进了Cutkosky与Mehta（2021）中算法的分析，在不依赖上述限制性假设的前提下，获得了相同的接近最优的高概率收敛速率$O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$。其次，耐人寻味的是，我们证明当目标函数$F(x)$被假设为$\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$的形式（这堪称最广泛适用的随机优化问题类别）时，仅需引入微小的结构信息，即可实现比下界$\Omega(T^{\frac{1-p}{3p-2}})$更快的收敛速率。针对此类问题，我们提出了首个方差缩减加速算法，并证明了在温和条件下该算法能保证高概率收敛速率为$O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$，该速率确实快于$\Omega(T^{\frac{1-p}{3p-2}})$。值得注意的是，即使特化为有限方差情形，我们的结果仍能给出（近）最优的高概率速率$O(\log(T/\delta)T^{-1/3})$。