Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.
翻译:多模态大语言模型(MLLMs)将大语言模型(LLMs)的成功扩展至图像、文本和音频等多种数据类型,在多模态翻译、视觉问答和内容生成等多个领域取得了显著性能。然而,由于异构模态模型和三维并行中复杂的数据依赖关系导致的大量GPU气泡,现有系统在训练MLLMs时效率低下。本文提出Optimus,一种分布式MLLM训练系统,旨在减少端到端MLLM训练时间。Optimus基于我们的原理性分析:在LLM气泡内调度编码器计算可以减少MLLM训练中的气泡。为了使所有GPU都能调度编码器计算,Optimus为编码器和LLM分别搜索并行策略,并采用气泡调度算法,在不破坏MLLM模型架构中原有数据依赖关系的前提下,实现对LLM气泡的利用。我们进一步将编码器层计算分解为一系列内核,并分析三维并行的常见气泡模式,以精细优化亚毫秒级的气泡调度,从而最小化整体训练时间。我们在生产集群中的实验表明,与基线相比,Optimus在使用ViT-22B和GPT-175B模型、3072个GPU的配置下,将MLLM训练速度提升了20.5%-21.3%。