Heterogeneous Parallelism for Multimodal Large Language Model Training

Yashaswi Karnati,Kamran Jafari,Akash Mehra,Li Ding,Pranav Prashant Thombre,Ali Roshan Ghias,Shifang Xu,Parth Mannan,Yu Yao,Hao Wu,Eric Harper,Ashwath Aithal,Nima Tajbakhsh

Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.

翻译：基础模型训练正从后训练流水线向大规模预训练迈进，呈现多模态化趋势。随着模态覆盖范围扩大、上下文窗口增长，编码器与大型语言模型的规模差异日益显著，单一以LLM为中心的TP/CP/PP/DP/EP并行布局逐渐成为吞吐量瓶颈。这种耦合机制迫使编码器继承LLM驱动的分片与放置策略，导致通信开销增加、编码器并行度受限或LLM调度约束——在长上下文场景中矛盾尤为突出：融合多模态序列需要LLM上下文并行，而编码器输入却仍受限于固定长度。针对上述问题，我们提出面向多模态大语言模型训练的异构并行计算框架，该抽象层允许端到端计算图中的各模块采用独立布局与秩放置策略，支持共享GPU共置执行与分离秩集非共置执行。核心挑战在于跨独立布局保持边界张量语义一致性：前向激活需按目标布局具体化，反向梯度则需回传至源布局。我们通过实现前向/反向布局变换的边界通信器，以及适配两种放置模式的调度扩展来应对该问题。在多模态工作负载和不同GPU规模下，我们系统评估了优化同构配置、共置异构配置与非共置异构配置的性能表现，以表征布局与放置自由度如何揭示更优运行点。实验表明：共置异构方案使单位GPU TFLOPS提升最多49.3%，非共置异构方案使聚合令牌吞吐量提升13.0%、单位GPU TFLOPS提升9.6%。我们验证了与同构基线相当的损失收敛性，并将该系统作为开源Megatron-LM扩展模块发布。