Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study distributed optimization with gradient clipping, i.e., in which certain clipping operators are applied to the gradients or gradient estimates computed from local clients prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noises in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a distributed gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We establish that, for the first time in the strongly convex setting with heavy-tailed gradient noises that may not have finite moments of order greater than one, the proposed distributed gradient method's mean square error (MSE) converges to zero at a rate $O(1/t^\iota)$, $\iota \in (0, 1/2)$, where the exponent $\iota$ stays bounded away from zero as a function of the problem condition number and the first absolute moment of the noise and, in particular, is shown to be independent of the existence of higher order gradient noise moments $\alpha > 1$. Numerical experiments validate our theoretical findings.
翻译:受重尾梯度噪声下大规模机器学习理解与分析的启发,我们研究了基于梯度裁剪的分布式优化方法,即在对局部客户端计算出的梯度或梯度估计进行进一步处理前,先应用特定裁剪算子。虽然普通梯度裁剪已在非分布式环境下被证明能有效缓解重尾梯度噪声的影响,但该过程会引入偏差,导致异质分布式设置中出现收敛问题。为解决梯度裁剪固有的偏差问题,我们提出一种平滑裁剪算子,并设计配备误差反馈机制的分布式梯度方法——即裁剪算子应用于局部梯度估计与局部随机梯度之间的差值上。我们首次证明:在梯度噪声可能不存在高于一阶的有限矩的强凸重尾噪声设置中,所提分布式梯度方法的均方误差以速率 $O(1/t^\iota)$, $\iota \in (0, 1/2)$ 收敛至零。其中指数 $\iota$ 作为问题条件数与噪声一阶绝对矩的函数保持非零下界,且特别地,该指数被证明与高阶梯度噪声矩 $\alpha > 1$ 的存在性无关。数值实验验证了我们的理论发现。