Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.
翻译:梯度压缩通过减少传输数值数量及其对应索引(通常采用全局收集操作)来缓解分布式深度学习中的通信开销。采用高压缩比(CR)训练的模型可达到与DenseSGD相当的精度,但因通信成本过高导致并行扩展性不足(即并行效率降低)。降低压缩比虽能通过减少同步开销提升并行效率,但会损害模型精度(统计效率)。此外,不同模型与压缩比下的加速效果还受网络延迟、有效带宽及聚合使用的集合通信操作影响。在许多场景中,全归约(AR)等集合通信操作在交换等量数据时比全局收集(AG)具有更低的成本。本文提出一种与AR兼容的Topk压缩器,该压缩器具有带宽最优特性,因此在特定网络配置下性能优于AG。我们设计了一种灵活通信策略,可根据当前设置下最优的集合通信操作在AG与AR间动态切换,并将并行效率与统计效率间的帕累托关系建模为多目标优化(MOO)问题,从而动态调整压缩比,在保持高精度收敛的同时加速训练过程。