Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study decentralized optimization with gradient clipping, i.e., in which certain clipping operators are applied to the gradients or gradient estimates computed from local nodes prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noise in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a decentralized gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We consider strongly convex and smooth local functions under symmetric heavy-tailed gradient noise that may not have finite moments of order greater than one. We show that the proposed decentralized gradient clipping method achieves a mean-square error (MSE) convergence rate of $O(1/t^\delta)$, $\delta \in (0, 2/5)$, where the exponent $\delta$ is independent of the existence of higher order gradient noise moments $\alpha > 1$ and lower bounded by some constant dependent on condition number. To the best of our knowledge, this is the first MSE convergence result for decentralized gradient clipping under heavy-tailed noise without assuming bounded gradient. Numerical experiments validate our theoretical findings.
翻译:受大规模机器学习在重尾梯度噪声下的理解与分析所启发,我们研究了结合梯度裁剪的去中心化优化方法,即在本地节点计算得到的梯度或梯度估计在进一步处理之前,会施加特定的裁剪算子。虽然原始梯度裁剪在非分布式设置中已被证明能有效缓解重尾梯度噪声的影响,但其引入的偏差在异构分布式环境中会导致收敛问题。为解决梯度裁剪固有的偏差问题,我们开发了一种平滑裁剪算子,并提出了一种配备误差反馈机制的去中心化梯度方法,即将裁剪算子应用于某个本地梯度估计量与本地随机梯度之间的差值。我们考虑在对称重尾梯度噪声下的强凸且光滑的局部目标函数,该噪声可能不存在大于一阶的有限矩。我们证明了所提出的去中心化梯度裁剪方法能达到 $O(1/t^\delta)$(其中 $\delta \in (0, 2/5)$)的均方误差收敛速率,其中指数 $\delta$ 与高阶梯度噪声矩 $\alpha > 1$ 的存在性无关,且下界为某个依赖于条件数的常数。据我们所知,这是在重尾噪声下,首个无需假设梯度有界的去中心化梯度裁剪均方误差收敛结果。数值实验验证了我们的理论发现。