Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.
翻译:通信调度已被证明能有效加速分布式训练,它使得全归约(all-reduce)通信能够与反向传播计算重叠。这一技术已广泛应用于主流分布式深度学习框架中。然而,现有方法存在两个根本性问题:(1)每个全归约操作因工作节点数量过多而产生显著的启动延迟;(2)由于下一轮迭代前向传播计算的依赖性和同步要求,其训练性能仅能达到次优。我们提出一种名为DeAR的新型调度算法,该算法将全归约原语解耦为两个连续操作,在不引入额外通信的前提下,使其同时与反向传播和前向传播计算重叠。我们进一步设计了一种实用的张量融合算法以提升训练性能。在五种流行模型上的实验结果表明,在配备10Gb/s以太网和100Gb/s InfiniBand互联的64-GPU集群上,DeAR相较于现有最优方案分别实现了最高83%和15%的训练加速。