Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data. In this paper, we present DIP, a dynamic and modality-aware pipeline scheduling framework designed for LMM training. DIP tackles the challenge of dynamic imbalance via two key techniques: (1) separating computations of different modalities into dedicated pipeline segments to balance workloads within a continuous set of stages; (2) dynamically splitting input data into finer-grained, modality-specific sub-microbatches to balance workloads across these segments. By asynchronously generating pipeline schedules on idle CPU resources during training, DIP dynamically tailors stage executions to each input batch without stalling the training process. We validate DIP on a diverse set of five LMMs, ranging from 12B to 94B parameters and including vision-language and diffusion models. Experimental results show that our system achieves up to 97.3% higher throughput compared to state-of-the-art systems, demonstrating strong adaptability to fluctuating multimodal training workloads.
翻译:大型多模态模型(LMM)在各类模态的理解与生成任务中展现出卓越性能。尽管这类模型可接受灵活组合的输入数据,其训练效率却面临两大问题:异构模型架构导致的流水线阶段失衡,以及多模态数据多样性引发的训练数据动态性。本文提出DIP——一种面向LMM训练的、具备动态模态感知能力的流水线调度框架。DIP通过两项关键技术应对动态失衡挑战:(1)将不同模态的计算分离至专用流水线段,以在连续阶段集合内实现工作负载均衡;(2)动态将输入数据拆分为更细粒度的、按模态划分的微批次,以实现跨段负载均衡。通过利用训练过程中的空闲CPU资源异步生成流水线调度方案,DIP可在不中断训练进程的前提下,为每个输入批次动态定制阶段执行方案。我们在涵盖12B至94B参数量的五个不同LMM上验证了DIP,包括视觉-语言模型与扩散模型。实验结果表明,相较于现有最优系统,本系统可实现高达97.3%的吞吐量提升,展现了其对波动的多模态训练工作负载的强适应性。