Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
翻译:梯度稀疏化是一种用于扩展和加速分布式深度神经网络(DNN)训练的通信优化技术。该技术通过减少梯度聚合过程中日益增长的通信流量来提升效率。然而,现有稀疏化方法因梯度选择的高计算开销和/或通信流量增加而存在可扩展性差的问题。具体而言,通信流量的增加源于梯度累积以及梯度选择阈值的不当设定。为解决这些挑战,我们提出了一种名为MiCRO的新型梯度稀疏化方法。在MiCRO中,梯度向量被分割为多个分区,每个分区分配给对应的工作节点。各工作节点从自身分区中选择梯度,从而避免聚合梯度中的梯度累积问题。此外,MiCRO通过最小化压缩比误差来估算精确阈值,从而根据用户需求维持通信流量。该方法通过解决阻碍分布式DNN训练可扩展性与加速的现有问题,实现了近乎零成本的梯度稀疏化。在大量实验中,MiCRO以卓越的收敛速度超越了现有最先进的稀疏化方法。