Gradient compression has surfaced as a key technique to address the challenge of communication efficiency in distributed learning. In distributed deep learning, however, it is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. In this paper, we introduce a novel compression scheme specifically engineered for heavy-tailed gradients, which effectively combines gradient truncation with quantization. This scheme is adeptly implemented within a communication-limited distributed Stochastic Gradient Descent (SGD) framework. We consider a general family of heavy-tail gradients that follow a power-law distribution, we aim to minimize the error resulting from quantization, thereby determining optimal values for two critical parameters: the truncation threshold and the quantization density. We provide a theoretical analysis on the convergence error bound under both uniform and non-uniform quantization scenarios. Comparative experiments with other benchmarks demonstrate the effectiveness of our proposed method in managing the heavy-tailed gradients in a distributed learning environment.
翻译:梯度压缩已成为解决分布式学习中通信效率挑战的关键技术。然而在分布式深度学习中,梯度分布呈现重尾特性,异常值对压缩策略的设计具有显著影响。现有参数量化方法忽略这种重尾特征时会出现性能退化。本文提出一种专为重尾梯度设计的压缩方案,该方案有效结合梯度截断与量化技术,并在通信受限的分布式随机梯度下降(SGD)框架中得到巧妙实现。我们考虑服从幂律分布的重尾梯度一般族,旨在最小化量化误差,进而确定截断阈值与量化密度两个关键参数的最优取值。分别针对均匀量化与非均匀量化场景,给出收敛误差界的理论分析。与其他基准方法的对比实验证明,所提方法在分布式学习环境中管理重尾梯度具有有效性。