Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. Our code is available at: https://github.com/hey-cjj/MoVE-KD.
翻译:视觉编码器是视觉语言模型(VLM)的基础组件,每种编码器都基于不同的预训练视觉基础模型展现出独特的优势。为了利用这些编码器的多样化能力,近期研究尝试在单个VLM中集成多个编码器,但这显著增加了计算成本。本文提出混合视觉编码器知识蒸馏(MoVE-KD),这是一个新颖的框架,旨在将多个视觉编码器的独特能力蒸馏到一个高效的单编码器模型中。具体而言,为缓解冲突并保留每个教师编码器的独有特性,我们采用低秩自适应(LoRA)和混合专家(MoE)机制,根据输入特征选择性激活特定知识,从而提升模型的适应性与效率。为了规范知识蒸馏过程并提升性能,我们提出一种基于注意力的蒸馏策略,该策略自适应地权衡不同编码器的重要性,并强调有价值的视觉标记,从而减轻从多位教师模型中复制全面但各异的特征所带来的负担。在LLaVA和LLaVA-NeXT等主流视觉语言模型上的综合实验验证了我们方法的有效性。我们的代码已开源:https://github.com/hey-cjj/MoVE-KD。