Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and batches. We present Entrain, a distributed MLLM training framework that addresses both heterogeneity and variability in multimodal training workloads. Entrain challenges the intuition that dynamic data variability requires dynamic model parallelism by shifting the profiling paradigm from micro-level samples to macroscopic batches. We prove that a single, static model-parallel configuration suffices for optimal load balancing under this paradigm. At the microscopic scale, Entrain introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to stabilize variability across microbatches. Evaluations show that Entrain reduces workload variability across microbatches by up to 10.6$\times$, improving end-to-end training throughput by up to 1.40$\times$ over existing baselines.
翻译:多模态大语言模型数据集天然具有异质性,数据变异性显著。尽管每种模态表现出独立变异,但样本级别的纠缠使得跨模态和跨批次的负载均衡困难。我们提出Entrain,一种分布式多模态大语言模型训练框架,旨在解决多模态训练负载中的异质性与变异性。该框架通过将剖析范式从微观样本转向宏观批次,挑战了“动态数据变异性需要动态模型并行”的传统直觉。我们证明,在此范式下,单个静态模型并行配置足以实现最优负载均衡。在微观尺度上,Entrain引入了一种分层微批次分配算法,通过推迟每次迭代中的超额负载来稳定微批次间的变异性。评估表明,Entrain将微批次间工作负载变异性降低达10.6倍,相较现有基线方法,端到端训练吞吐量提升达1.40倍。