Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs. In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by $2.26\times$ on average in terms of MLLM training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.
翻译:多模态大语言模型(MLLMs)通过结合异构模型架构来扩展大语言模型(LLMs)的能力,以处理图像和音频等多种模态。然而,MLLM模型结构与数据类型固有的异构性,使得对现有LLM训练框架的权宜扩展不适合高效训练MLLM。尽管已有少数研究工作尝试解决MLLM训练中的异构性问题,但其方法仅停留在浅层考虑MLLM特性。本文提出Cornstarch——一个高效分布式MLLM训练框架,该框架在模型并行与数据并行中充分考量MLLM的独特特性。Cornstarch引入冻结感知的流水线并行与令牌工作负载均衡的上下文并行,以提升MLLM训练吞吐量。广泛评估表明,Cornstarch在MLLM训练吞吐量上平均超越当前最先进方案$2.26\times$。Cornstarch为开源项目,代码见 https://github.com/cornstarch-org/Cornstarch。