Instruction tuning of the Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.
翻译:大规模视觉语言模型(LVLMs)的指令微调革命性地推动了具备零样本泛化能力的通用模型发展,使其能够广泛适用于各类下游视觉语言任务。然而,不同来源与格式的多样化训练任务会导致不可避免的任务冲突——不同任务为争夺同一组模型参数而产生冲突,进而导致指令遵循能力出现次优表现。为解决该问题,我们提出簇条件化LoRA专家混合模型(MoCLE),这是一种基于指令簇激活任务定制化模型参数的新型专家混合(MoE)架构。通过进一步引入独立通用专家,MoCLE对新型指令的泛化能力得到增强。在10项零样本任务上的大量实验验证了MoCLE的有效性。