To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06x and 1.43x speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).
翻译:为加速分布式训练,研究者提出了多种梯度压缩方法以缓解同步随机梯度下降(S-SGD)中的通信瓶颈,但其在实际应用中的有效性仍不明确。本文首先在32GPU集群上评估了三类代表性压缩方法(基于Sign-SGD的量化、基于Top-k SGD的稀疏化及基于Power-SGD的低秩分解)的效率。结果表明,由于这些方法与S-SGD中三项关键系统优化技术(全归约、流水线及张量融合)不兼容,其性能无法始终优于充分优化的S-SGD,甚至更差。为此,我们提出新型梯度压缩方法——交替压缩式Power-SGD(ACP-SGD),该方法通过交替压缩与通信低秩矩阵,不仅显著降低通信量,还能像S-SGD一样兼容三种系统优化技术。相较于Power-SGD,优化后的ACP-SGD在实现相近模型精度的同时,大幅降低了压缩与通信开销。实验中,ACP-SGD在S-SGD和Power-SGD基础上分别实现平均4.06倍和1.43倍加速,并在不同配置(8GPU至64GPU集群、1Gb/s以太网至100Gb/s InfiniBand网络)下始终优于其他基线方法。