Instruction tuning of the Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.
翻译:大型视觉语言模型(LVLMs)的指令微调彻底革新了开发具备零样本泛化能力的通用模型,使其能够应对多种下游视觉语言任务。然而,来自不同来源与格式的多样化训练任务会导致不可避免的任务冲突——不同任务对同一组模型参数产生竞争,进而造成指令遵循能力的次优表现。为解决该问题,我们提出聚类条件LoRA专家混合架构(MoCLE),这是一种新颖的专家混合(MoE)架构,设计通过指令聚类激活任务定制化的模型参数。此外,我们进一步集成独立的通用专家以增强MoCLE对新指令的泛化能力。在10项零样本任务上的广泛实验证明了MoCLE的有效性。