Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

翻译：梯度压缩通过减少传输数值数量及其对应索引（通常采用全局收集操作）来缓解分布式深度学习中的通信开销。采用高压缩比（CR）训练的模型可达到与DenseSGD相当的精度，但因通信成本过高导致并行扩展性不足（即并行效率降低）。降低压缩比虽能通过减少同步开销提升并行效率，但会损害模型精度（统计效率）。此外，不同模型与压缩比下的加速效果还受网络延迟、有效带宽及聚合使用的集合通信操作影响。在许多场景中，全归约（AR）等集合通信操作在交换等量数据时比全局收集（AG）具有更低的成本。本文提出一种与AR兼容的Topk压缩器，该压缩器具有带宽最优特性，因此在特定网络配置下性能优于AG。我们设计了一种灵活通信策略，可根据当前设置下最优的集合通信操作在AG与AR间动态切换，并将并行效率与统计效率间的帕累托关系建模为多目标优化（MOO）问题，从而动态调整压缩比，在保持高精度收敛的同时加速训练过程。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日