In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.
翻译:摘要:近年来,训练最先进神经网络所需的内存已远超现代硬件加速器的DRAM容量,这促使研究者开发高效算法,在基于GPU的大规模集群上并行训练此类网络。由于现代GPU的计算成本相对较低,设计并实现这些并行训练算法中极其高效的通信机制,成为提取最大性能的关键。本文提出AxoNN——一种利用异步性与消息驱动执行机制在每块GPU上调度神经网络操作的并行深度学习框架,从而减少GPU空闲时间并最大化硬件效率。通过将CPU内存作为暂存空间,在训练过程中周期性地卸载数据,AxoNN可将GPU内存消耗降低四倍。这使得每块GPU可容纳的参数数量增加四倍,从而减少通信量并提升超过13%的性能。在48-384块NVIDIA Tesla V100 GPU上对参数量为120亿至1000亿的大型Transformer模型进行测试时,AxoNN实现了理论峰值49.4%-54.78%的每GPU吞吐量,并将训练时间缩短22-37天(加速比15%-25%),相较现有最优方法表现更优。