Recently, the study of heavy-tailed noises in first-order nonconvex stochastic optimization has gotten a lot of attention since it was recognized as a more realistic condition as suggested by many empirical observations. Specifically, the stochastic noise (the difference between the stochastic and true gradient) is considered to have only a finite $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$ instead of assuming it always satisfies the classical finite variance assumption. To deal with this more challenging setting, people have proposed different algorithms and proved them to converge at an optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate for smooth objectives after $T$ iterations. Notably, all these new-designed algorithms are based on the same technique - gradient clipping. Naturally, one may want to know whether the clipping method is a necessary ingredient and the only way to guarantee convergence under heavy-tailed noises. In this work, by revisiting the existing Batched Normalized Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm, we provide the first convergence result under heavy-tailed noises but without gradient clipping. Concretely, we prove that Batched NSGDM can achieve the optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate even under the relaxed smooth condition. More interestingly, we also establish the first $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$ convergence rate in the case where the tail index $\mathfrak{p}$ is unknown in advance, which is arguably the common scenario in practice.
翻译:近年来,一阶非凸随机优化中的重尾噪声研究受到广泛关注,因为许多实证观察表明这是一种更符合现实的条件。具体而言,随机噪声(随机梯度与真实梯度之间的差异)被认为仅具有有限的 $\mathfrak{p}$ 阶矩,其中 $\mathfrak{p}\in\left(1,2\right]$,而非总是满足经典的有限方差假设。为应对这一更具挑战性的设定,研究者提出了多种算法,并证明其在 $T$ 次迭代后对于光滑目标函数能以最优的 $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ 速率收敛。值得注意的是,这些新设计的算法均基于同一技术——梯度裁剪。很自然地,人们可能想知道裁剪方法是否是重尾噪声下保证收敛的必要成分和唯一途径。在本工作中,通过重新审视现有的批归一化动量随机梯度下降(Batched NSGDM)算法,我们首次在重尾噪声下实现了无需梯度裁剪的收敛性证明。具体而言,我们证明即使在松弛的光滑条件下,Batched NSGDM 仍能达到最优的 $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ 收敛速率。更有趣的是,我们还首次建立了在尾指数 $\mathfrak{p}$ 未知(这无疑是实践中的常见情形)情况下的 $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$ 收敛速率。