On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed training, aiming to unravel these bottlenecks and suggest optimization directions. However, such analyses often overlook three aspects unique to Transformer models: the specialized architecture, the dependency on various distributed strategies, and the requirement to balance computational and memory overhead. This paper aims to bridge this gap by offering a comprehensive examination of the performance bottlenecks inherent in distributed training of Transformer models, leveraging both theoretical analysis and empirical investigation. We propose an analytical framework tailored to these unique aspects of Transformers, facilitating a holistic evaluation of model architectures, distributed strategies, and resource consumption. Based on this analytical framework, we conduct a comparative analysis of theoretical performances and further systematically explore how various distributed training strategies fare in real-world scenarios. Most of the experimental results can be well explained by the analytical outcomes derived from the analytical framework. Notably, our findings suggest an advantage of pipeline parallelism over data parallelism for Transformer models. Moreover, we shed light on some unexpected outcomes, such as the potential for increased total memory overhead due to suboptimal model partitioning within pipeline parallelism. Additionally, we underscore the significance of communication block size and waiting time to further enhance performance.

翻译：Transformer模型已成为解决多学科领域广泛挑战的有效方案。然而，其庞大的计算与内存需求严重阻碍了Transformer架构的实际部署，这使得依赖先进高效的分布式训练方法成为必然。先前研究已深入探讨分布式训练中的性能瓶颈，旨在揭示这些瓶颈并提出优化方向。然而，此类分析往往忽视了Transformer模型特有的三个方面：其特殊架构、对不同分布式策略的依赖性，以及平衡计算与内存开销的需求。本文旨在通过理论分析与实证研究相结合的方式，对Transformer模型分布式训练中的固有性能瓶颈进行全面考察，以弥补这一研究空白。我们提出了一个针对Transformer这些独特特性设计的分析框架，该框架有助于对模型架构、分布式策略及资源消耗进行整体评估。基于此分析框架，我们首先对理论性能进行了比较分析，进而系统探究了各种分布式训练策略在实际场景中的表现。大部分实验结果可通过分析框架得出的结论得到合理解释。值得注意的是，我们的研究发现，对于Transformer模型，流水线并行相较于数据并行展现出优势。此外，我们揭示了一些意外结果，例如流水线并行中因模型划分欠佳可能导致总内存开销增加。同时，我们强调了通信块大小与等待时间对进一步提升性能的重要性。