Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study distributed optimization with smoothed gradient clipping, i.e., in which certain smoothed clipping operators are applied to the gradients or gradient estimates computed from local clients prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noises in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a distributed gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We establish that, for the first time in the strongly convex setting with heavy-tailed gradient noises that may not have finite moments of order greater than one, the proposed distributed gradient method's mean square error (MSE) converges to zero at a rate $O(1/t^\iota)$, $\iota \in (0, 0.4)$, where the exponent $\iota$ stays bounded away from zero as a function of the problem condition number and the first absolute moment of the noise. Numerical experiments validate our theoretical findings.
翻译:为了理解和分析重尾梯度噪声下的大规模机器学习,本文研究带有平滑梯度裁剪的分布式优化。在此优化中,局部客户端在进一步处理之前,会对梯度或梯度估计应用特定的平滑裁剪算子。虽然标准梯度裁剪在非分布式场景中已被证明能有效减轻重尾梯度噪声的影响,但它会引入偏差,导致异构分布式设置中出现收敛问题。为解决梯度裁剪内在的偏差,我们开发了一种平滑裁剪算子,并提出了一种配备误差反馈机制的分布式梯度方法,即对局部梯度估计与局部随机梯度之间的差值应用裁剪算子。我们首次证明,在强凸设定下,当重尾梯度噪声可能不存在大于一阶的有限矩时,所提分布式梯度方法的均方误差以速率 $O(1/t^\iota)$ (其中 $\iota \in (0, 0.4)$)收敛至零,且指数 $\iota$ 作为问题条件数和噪声一阶绝对矩的函数保持有界远离零。数值实验验证了我们的理论结果。