Large communication costs are a critical bottleneck in training state-of-the-art neural networks on distributed systems. This paper introduces AxoNN, a novel four-dimensional (4D) parallelization approach, inspired by Agarwal's algorithm for matrix multiplication, for parallelizing tensor computations in deep learning, AxoNN employs two key strategies to minimize communication overhead. First, we optimize communication by overlapping expensive collective operations (reduce-scatter, all-gather, all-reduce) with computations. Our experiments with a 20-billion parameter transformer model demonstrate that these optimizations deliver nearly 53\% improvement. Second, we present an analytical model to assist users in identifying communication-minimizing configurations within the vast search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter model on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves 57% of the theoretical peak FLOP/s.
翻译:在分布式系统上训练最先进的神经网络时,巨大的通信成本是关键瓶颈。本文介绍AxoNN——一种受Agarwal矩阵乘法算法启发的新型四维并行化方法,用于深度学习中的张量计算并行化。AxoNN采用两种关键策略来最小化通信开销。首先,我们通过将昂贵的集合通信操作(reduce-scatter、all-gather、all-reduce)与计算重叠来进行通信优化。针对200亿参数Transformer模型的实验表明,这些优化带来了近53%的性能提升。其次,我们提出一个分析模型,帮助用户在我们的四维算法定义的广阔搜索空间中识别通信最小化配置。该模型通过简化特定训练工作负载的调优过程来赋能实践者。在Perlmutter系统的1024块GPU上训练800亿参数模型时,AxoNN比最先进的框架Megatron-LM显著提升26%。此外,它达到了理论峰值浮点运算性能的57%。