Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification, which sparsifies the gradients by a fixed degree during model training. However, there has been a lack of an adaptive approach to adjust the sparsification degree to maximize the potential of the model's performance or training speed. This paper proposes a novel adaptive Top-K in SGD framework that enables an adaptive degree of sparsification for each gradient descent step to optimize the convergence performance by balancing the trade-off between communication cost and convergence error. Firstly, an upper bound of convergence error is derived for the adaptive sparsification scheme and the loss function. Secondly, an algorithm is designed to minimize the convergence error under the communication cost constraints. Finally, numerical results on the MNIST and CIFAR-10 datasets demonstrate that the proposed adaptive Top-K algorithm in SGD achieves a significantly better convergence rate compared to state-of-the-art methods, even after considering error compensation.
翻译:分布式随机梯度下降(SGD)结合梯度压缩已成为加速分布式学习的一种流行的通信高效解决方案。梯度压缩中常用的方法是Top-K稀疏化,它在模型训练过程中以固定程度对梯度进行稀疏化。然而,目前缺乏一种自适应方法来调整稀疏化程度,以最大化模型性能或训练速度的潜力。本文提出了一种新颖的自适应Top-K SGD框架,该框架能够对每次梯度下降步骤实现自适应的稀疏化程度,通过在通信成本与收敛误差之间取得平衡来优化收敛性能。首先,针对自适应稀疏化方案和损失函数,推导了收敛误差的上界。其次,设计了一种在通信成本约束下最小化收敛误差的算法。最后,在MNIST和CIFAR-10数据集上的数值结果表明,所提出的SGD自适应Top-K算法相较于最新方法(即使考虑了误差补偿)实现了显著更优的收敛速度。