Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.
翻译:通信调度已被证明能有效加速分布式训练,通过使全规约通信与反向传播计算重叠来实现加速。这一技术已被广泛应用于主流分布式深度学习框架中。然而,现有方法存在两个根本性问题:(1) 每次全规约操作中与工作节点数量成正比的过高启动延迟;(2) 由于下一轮迭代前向计算存在依赖性和同步要求,仅能实现次优的训练性能。本文提出了一种新型调度算法DeAR,将全规约操作解耦为两个连续的子操作,无需额外通信即可同时与反向传播和前向计算重叠。我们进一步设计了一种实用的张量融合算法以提升训练性能。基于五种主流模型的实验结果表明,在配备10Gb/s以太网和100Gb/s InfiniBand互连的64-GPU集群上,DeAR相较现有最优方案分别实现了最高83%和15%的训练加速比。