Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization. Our findings highlight the substantial potential of MoE frameworks in advancing MLLMs and the code is available at https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.
翻译:近期多模态大语言模型(MLLMs)的研究进展凸显了可扩展模型与数据对提升性能的重要性,然而这往往伴随着高昂的计算成本。尽管混合专家模型(MoE)架构已被用于高效扩展大规模语言模型和图文模型,但这些工作通常仅涉及少量专家和有限模态。为解决这一问题,本研究首次尝试构建基于MoE架构的统一多模态大语言模型Uni-MoE,该模型能够处理多种模态信息。具体而言,Uni-MoE采用模态专用编码器与连接器实现统一多模态表征,并在大语言模型内部引入稀疏MoE架构,通过模态级数据并行与专家级模型并行实现高效训练与推理。为增强多专家协同与泛化能力,我们提出渐进式训练策略:1)利用不同连接器与跨模态数据完成跨模态对齐;2)采用跨模态指令数据训练模态专用专家以激活专家偏好;3)通过低秩适配(LoRA)在混合多模态指令数据上微调Uni-MoE框架。我们在多模态基准数据集上对经指令微调的Uni-MoE进行全面评估,大量实验结果表明:Uni-MoE的核心优势在于显著降低混合多模态数据集处理中的性能偏差,同时提升多专家协同与泛化能力。本文研究揭示了MoE架构在推进MLLMs领域的巨大潜力,相关代码已开源至https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs。