Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflicts are inevitable when mixing instruction data from distinct domains, which can result in performance drops for tasks of a specific domain. To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs. Within the Transformer layers, we extend the popular Low-Rank Adaption (LoRA) method by creating a set of LoRA experts specifically for the MLP layer, and route each token to the top-1 expert based on a routing function, allowing adaptive choices for tokens from different domains. Since the LoRA experts are sparsely activated, the training and inference cost are kept roughly constant compared to the original LoRA method. By replacing the plain-LoRA of LLaVA-1.5 with our MoE design, our final model is named LLaVA-MoLE. Extensive experiments proved that LLaVA-MoLE effectively mitigates the data conflict issue when mixing multiple distinct instruction datasets with various configurations, and achieves consistent performance gains over the strong plain-LoRA baselines. Most importantly, on the mixed datasets, LLaVA-MoLE can even outperform the plain-LoRA baseline trained with twice the samples.
翻译:在各种图像-文本指令数据上进行指令微调是获得通用多模态大语言模型(MLLM)的关键,而指令数据的不同配置会导致微调模型具备不同的能力。然而,我们发现混合不同领域的指令数据时不可避免地存在数据冲突,这可能导致特定领域任务的性能下降。为解决这一问题,我们提出了一种高效的混合专家(MoE)设计,即用于多模态大语言模型指令微调的稀疏混合LoRA专家(MoLE)。在Transformer层中,我们扩展了流行的低秩适配(LoRA)方法,通过为MLP层创建一组LoRA专家,并基于路由函数将每个token路由到得分最高的专家,从而允许来自不同领域的token自适应选择。由于LoRA专家是稀疏激活的,相较于原始LoRA方法,训练和推理成本基本保持不变。通过将LLaVA-1.5中的普通LoRA替换为我们的MoE设计,最终模型命名为LLaVA-MoLE。大量实验证明,LLaVA-MoLE在混合多个不同配置的指令数据集时有效缓解了数据冲突问题,并在强基线的普通LoRA方法上持续取得性能提升。最重要的是,在混合数据集上,LLaVA-MoLE的性能甚至能超越使用两倍样本训练得到的普通LoRA基线。