Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6\%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1\times$ in throughput.
翻译:多模态大语言模型(如GPT-4o)正受到广泛关注。在探索MLLM训练过程中,我们发现了模态构成不一致现象,即不同样本间特定模态的比例存在剧烈波动。该现象加剧了处理小批次不平衡的挑战,导致数据并行实例间的GPU利用率不均,严重降低了MLLM训练的效率和可扩展性,最终影响训练速度并阻碍MLLM的进一步研究。为应对这些挑战,我们提出了OrchMLLM——一个旨在缓解由模态构成不一致引起的MLLM训练低效问题的综合框架。首先,我们提出了批次后平衡调度器,该技术能有效消除序列数据中的小批次不平衡问题。此外,我们将MLLM全局编排器集成到训练框架中,以协调多模态数据并解决模态构成不一致引发的问题。我们在不同规模的MLLM上评估了OrchMLLM,验证了其高效性和可扩展性。实验结果表明,在$2560$块H100 GPU上训练包含三种模态的840亿参数MLLM时,OrchMLLM实现了$41.6\%$的模型浮点运算利用率,其吞吐量最高可达Megatron-LM的$3.1$倍。