Distributed training is the de facto standard to scale up the training of Deep Neural Networks (DNNs) with multiple GPUs. The performance bottleneck of distributed training lies in communications for gradient synchronization. Recently, practitioners have observed sparsity in gradient tensors, suggesting the potential to reduce the traffic volume in communication and improve end-to-end training efficiency. Yet, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to address this gap. We first analyze the characteristics of sparse tensors in popular DNN models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal one. % We then find the optimal scheme based on the characteristics by systematically exploring the design space. We also develop a gradient synchronization system called Zen that approximately realizes it for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to 2.48x speedup in training throughput compared to the state-of-the-art methods.
翻译:分布式训练是使用多个GPU扩展深度神经网络(DNN)训练的事实标准。分布式训练的性能瓶颈在于梯度同步的通信开销。近期,研究者观察到梯度张量中存在的稀疏性,这有望降低通信流量并提升端到端训练效率。然而,目前仍缺乏能够充分利用稀疏性的最优通信方案。本文旨在填补这一空白。我们首先分析了主流DNN模型中稀疏张量的特性,以理解稀疏性的基本原理。随后系统探索了稀疏张量通信方案的设计空间,并找到了最优方案。基于该特性分析,我们开发了名为Zen的梯度同步系统,可近似实现稀疏张量的最优通信方案。实验表明,与现有最优方法相比,Zen在通信时间上可实现最高5.09倍的加速,在训练吞吐量上可实现最高2.48倍的加速。