In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs (MLLMs). However, TP introduces significant collective communication overheads, while PP suffers from synchronization inefficiencies such as pipeline bubbles. Existing works primarily address these challenges from isolated perspectives, focusing either on overlapping TP communication or on flexible PP scheduling to mitigate pipeline bubbles. In this paper, we propose a new synergistic tensor and pipeline parallelism schedule that simultaneously reduces both types of bubbles. Our proposed schedule decouples the forward and backward passes in PP into fine-grained computation units, which are then braided to form a composite computation sequence. This compositional structure enables near-complete elimination of TP-related bubbles. Building upon this structure, we further design the PP schedule to minimize PP bubbles. Experimental results demonstrate that our approach improves training throughput by up to 12% for LLMs and 16% for MLLMs compared to existing scheduling methods. Our source code is avaiable at https://github.com/MICLAB-BUPT/STP.
翻译:在机器学习系统中,结合张量并行(TP)与流水线并行(PP)的混合模型并行方案已成为大型语言模型(LLMs)及多模态大型语言模型(MLLMs)分布式训练的主流解决方案。然而,TP会引入显著的集合通信开销,而PP则受流水线气泡等同步效率问题困扰。现有研究主要从孤立视角应对这些挑战,侧重于重叠TP通信或通过灵活的PP调度来缓解流水线气泡。本文提出一种新型的协同张量与流水线并行调度方案,可同时降低两类气泡的影响。该方案将PP中的前向传播与反向传播解耦为细粒度计算单元,并通过交织编排形成复合计算序列。这种组合结构能够近乎完全消除TP相关气泡。基于此结构,我们进一步设计PP调度方案以最小化PP气泡。实验结果表明,相较于现有调度方法,本方案将LLMs的训练吞吐量提升最高达12%,MLLMs提升最高达16%。源代码已发布于https://github.com/MICLAB-BUPT/STP。