Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
翻译:多模态大语言模型(MLLMs)在各种视觉-语言任务中展现出卓越的性能。然而,通用型MLLM在多数视觉-语言任务上的表现通常逊于专用型MLLM,这主要归因于任务间干扰。本文提出一种多模态专家混合方法(MoME),旨在缓解任务干扰并构建通用型MLLM。我们的MoME包含两个核心组件:视觉专家混合模块(MoVE)与语言专家混合模块(MoLE)。MoVE能够自适应调节来自不同视觉编码器的特征转换,并在转换架构上具有强兼容性。MoLE将稀疏门控专家集成至大语言模型中,在推理成本基本不变的条件下实现性能的无痛提升。针对任务干扰问题,我们的MoME在视觉与语言模态上均实现专业化设计以适应任务差异。大量实验表明,MoME能显著提升通用型MLLM在各类视觉-语言任务上的性能。源代码已发布于 https://github.com/JiuTian-VL/MoME