Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
翻译:梯度压缩通过减少传输数值及其对应索引(通常使用Allgather(AG))来缓解分布式深度学习中的昂贵通信成本。高压缩比(CR)训练虽能达到与DenseSGD相当的高精度,但因其高通信成本导致并行扩展性降低(即并行效率)。使用低压缩比通过降低同步成本提升并行效率,但会同时降低模型精度(统计效率)。此外,不同模型与压缩比下的加速效果还取决于网络延迟、有效带宽及聚合所用的集合通信操作。在许多场景下,诸如Allreduce(AR)等集合通信在交换等量数据时比AG具有更低的成本。本文提出一种与AR兼容的Topk压缩器,该压缩器具备带宽最优性,因此在特定网络配置下性能优于AG。我们开发了一种灵活通信策略,可根据当前设置下最优的集合通信操作在AG与AR之间切换,并将并行效率与统计效率间的帕累托关系建模为多目标优化(MOO)问题,以动态调整压缩比,在加速训练的同时确保收敛至较高精度。