Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify several common issues in previous gradient compression systems and evaluation methods. These issues include excessive computational overheads; incompatibility with all-reduce; and inappropriate evaluation metrics, such as not using an end-to-end metric or using a 32-bit baseline instead of a 16-bit baseline. We propose several general design and evaluation techniques to address these issues and provide guidelines for future work. Our preliminary evaluation shows that our techniques enhance the system's performance and provide a clearer understanding of the end-to-end utility of gradient compression methods.
翻译:梯度聚合长期以来被认为是当今大规模分布式机器学习训练系统中的主要瓶颈。缓解此类瓶颈的一种有前景的解决方案是梯度压缩,它直接减少通信的梯度数据量。然而,在实践中,许多梯度压缩方案既未能加速训练过程,也未能保持模型精度。在本工作中,我们指出了先前梯度压缩系统与评估方法中存在的若干常见问题。这些问题包括:过高的计算开销;与 all-reduce 操作的不兼容性;以及不恰当的评估指标,例如未使用端到端指标,或使用 32 位基线而非 16 位基线。我们提出了若干通用的设计与评估技术以解决这些问题,并为未来工作提供指导。我们的初步评估表明,我们的技术提升了系统性能,并提供了对梯度压缩方法端到端效用的更清晰理解。