Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instructionfollowing abilities. To address that, we propose the Mixture of Clusterconditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve generalization capabilities of MoCLE for novel instructions. Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.
翻译:大规模视觉语言模型(LVLMs)的指令微调彻底改变了具备零样本泛化能力的多功能模型的发展,使其能够广泛适应各类下游视觉语言任务。然而,不同源与格式的训练任务多样性会导致不可避免的任务冲突——不同任务争夺同一组模型参数,从而产生次优的指令遵循能力。为解决此问题,我们提出混合聚类条件LoRA专家模型(MoCLE),这是一种新型混合专家(MoE)架构,通过指令聚类激活任务定制的模型参数。此外,我们引入独立的通用专家以增强MoCLE对新指令的泛化能力。在11项零样本任务上的大量实验证明了MoCLE的有效性。