This paper investigates the roles of gradient normalization and clipping in ensuring the convergence of Stochastic Gradient Descent (SGD) under heavy-tailed noise. While existing approaches consider gradient clipping indispensable for SGD convergence, we theoretically demonstrate that gradient normalization alone without clipping is sufficient to ensure convergence. Furthermore, we establish that combining gradient normalization with clipping offers significantly improved convergence rates compared to using either technique in isolation, particularly as gradient noise diminishes. With these results, our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. Finally, we introduce an accelerated SGD variant that incorporates both gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
翻译:本文研究了梯度归一化与截断在保证随机梯度下降(SGD)于重尾噪声下收敛中的作用。现有方法通常认为梯度截断对SGD收敛不可或缺,但我们从理论上证明,仅使用梯度归一化而无需截断即足以确保收敛。此外,我们证明,与单独使用任一技术相比,将梯度归一化与截断结合能显著提升收敛速率,尤其是在梯度噪声减弱时。基于这些结果,我们的工作首次提供了理论证据,表明梯度归一化在重尾噪声下的SGD中具有优势。最后,我们提出了一种结合梯度归一化与截断的加速SGD变体,进一步提升了在重尾噪声下的收敛速率。