Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication times and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication times by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
翻译:大规模神经网络并行训练因通信带来的显著开销而具有挑战性。近年来,深度学习研究者开发了多种剪枝算法,能够将神经网络中80-90%的参数剪枝(即设为零),从而产生与未剪枝的原始网络精度相当的稀疏子网络。本文提出了一种新颖方法,利用这些稀疏子网络来优化两种流行的并行深度学习算法(即数据并行和层间并行)中的内存利用率和通信开销。我们将该方法集成到AxoNN(一个依赖数据并行和层间并行的高度可扩展并行深度学习框架)中,并展示了通信时间和内存利用率的降低。在512个NVIDIA V100 GPU上,我们的优化将27亿参数模型的内存消耗降低了74%,总通信时间减少了40%,从而相对于AxoNN实现了34%的总体加速,相对于DeepSpeed-3D实现了32%的加速,相对于稀疏矩阵计算基准Sputnik实现了46%的加速。