Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with gradient clipping, i.e., clipping applied to the gradients computed from local information at the nodes. While gradient clipping is an essential tool for injecting formal DP guarantees into gradient-based methods [1], it also induces bias which causes serious convergence issues specific to the distributed setting. Inspired by recent progress in the error-feedback literature which is focused on taming the bias/error introduced by communication compression operators such as Top-$k$ [2], and mathematical similarities between the clipping operator and contractive compression operators, we design Clip21 -- the first provably effective and practically useful error feedback mechanism for distributed methods with gradient clipping. We prove that our method converges at the same $\mathcal{O}\left(\frac{1}{K}\right)$ rate as distributed gradient descent in the smooth nonconvex regime, which improves the previous best $\mathcal{O}\left(\frac{1}{\sqrt{K}}\right)$ rate which was obtained under significantly stronger assumptions. Our method converges significantly faster in practice than competing methods.
翻译:受差分隐私约束下大规模训练日益普及和重要性的推动,我们研究了带有梯度裁剪的分布式梯度方法,即对节点本地信息计算出的梯度进行裁剪。虽然梯度裁剪是将形式化差分隐私保证注入基于梯度的方法[1]的重要工具,但它也会引入偏差,从而导致分布式场景特有的严重收敛问题。受近年来专注于抑制通信压缩算子(如Top-$k$ [2])引入的偏差/错误的误差反馈文献进展,以及裁剪算子与收缩压缩算子之间的数学相似性的启发,我们设计了Clip21——首个可证明有效且实际有用的分布式梯度裁剪方法误差反馈机制。我们证明了该方法在光滑非凸场景下以与分布式梯度下降相同的$\mathcal{O}\left(\frac{1}{K}\right)$速率收敛,这改进了先前在显著更强假设下获得的最佳$\mathcal{O}\left(\frac{1}{\sqrt{K}}\right)$速率。我们的方法在实际中比竞争方法收敛速度快得多。