Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
翻译:大规模神经网络并行训练面临通信开销巨大的挑战。近期,深度学习研究者开发了多种剪枝算法,能够将神经网络中80-90%的参数置零,生成与未剪枝原始网络精度相当的稀疏子网络。本文提出一种新颖方法,利用这类稀疏子网络优化两种主流并行深度学习算法——数据并行和层间并行——中的内存利用与通信开销。我们将该方法集成至AxoNN(一种基于数据并行与层间并行的高度可扩展并行深度学习框架),并验证了其在通信时间与内存利用方面的优化效果。在512块NVIDIA V100 GPU上,我们的优化使27亿参数模型的内存消耗降低74%,总通信时间减少40%,相比AxoNN整体加速34%,相比DeepSpeed-3D加速32%,相比稀疏矩阵计算基线Sputnik加速46%。