Recently, the study of heavy-tailed noises in first-order nonconvex stochastic optimization has gotten a lot of attention since it was recognized as a more realistic condition as suggested by many empirical observations. Specifically, the stochastic noise (the difference between the stochastic and true gradient) is considered only to have a finite $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$ instead of assuming it always satisfies the classical finite variance assumption. To deal with this more challenging setting, people have proposed different algorithms and proved them to converge at an optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate for smooth objectives after $T$ iterations. Notably, all these new-designed algorithms are based on the same technique - gradient clipping. Naturally, one may want to know whether the clipping method is a necessary ingredient and the only way to guarantee convergence under heavy-tailed noises. In this work, by revisiting the existing Batched Normalized Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm, we provide the first convergence result under heavy-tailed noises but without gradient clipping. Concretely, we prove that Batched NSGDM can achieve the optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate even under the relaxed smooth condition. More interestingly, we also establish the first $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$ convergence rate in the case where the tail index $\mathfrak{p}$ is unknown in advance, which is arguably the common scenario in practice.
翻译:近年来,一阶非凸随机优化中的重尾噪声研究受到广泛关注,因为许多实证观察表明这是一种更符合现实的条件。具体而言,随机噪声(随机梯度与真实梯度之间的差异)仅被认为具有有限的$\mathfrak{p}$阶矩,其中$\mathfrak{p}\in\left(1,2\right]$,而非总是满足经典的有限方差假设。为应对这一更具挑战性的设定,研究者提出了多种算法,并证明其在$T$次迭代后能达到$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$的最优收敛速率(针对光滑目标函数)。值得注意的是,这些新设计的算法均基于同一技术——梯度裁剪。人们自然会产生疑问:裁剪方法是否是保证重尾噪声下收敛的必要条件与唯一途径?本文通过重新审视现有的批归一化动量随机梯度下降(Batched NSGDM)算法,首次证明了在重尾噪声下无需梯度裁剪的收敛性。具体而言,我们证明即使在松弛光滑条件下,Batched NSGDM仍能实现最优的$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$收敛速率。更有趣的是,当尾部指数$\mathfrak{p}$未知时(这无疑是实践中的常见情形),我们还首次建立了$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$的收敛速率。