Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

翻译：梯度压缩通过减少传输数值及其对应索引（通常使用Allgather（AG））来缓解分布式深度学习中的昂贵通信成本。高压缩比（CR）训练虽能达到与DenseSGD相当的高精度，但因其高通信成本导致并行扩展性降低（即并行效率）。使用低压缩比通过降低同步成本提升并行效率，但会同时降低模型精度（统计效率）。此外，不同模型与压缩比下的加速效果还取决于网络延迟、有效带宽及聚合所用的集合通信操作。在许多场景下，诸如Allreduce（AR）等集合通信在交换等量数据时比AG具有更低的成本。本文提出一种与AR兼容的Topk压缩器，该压缩器具备带宽最优性，因此在特定网络配置下性能优于AG。我们开发了一种灵活通信策略，可根据当前设置下最优的集合通信操作在AG与AR之间切换，并将并行效率与统计效率间的帕累托关系建模为多目标优化（MOO）问题，以动态调整压缩比，在加速训练的同时确保收敛至较高精度。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日