Recently, multi-modal content generation has attracted lots of attention from researchers by investigating the utilization of visual instruction tuning based on large language models (LLMs). To enhance the performance and generalization ability of such LLMs, the practice of distilling knowledge from pretrained multi-modal models (a.k.a. teachers) to more compact multi-modal LLMs (students) has gained considerable interest. However, the prevailing paradigm of instructiontuning in multi-modal LLMs knowledge distillation is resource-intensive and unidirectional, neglecting the potential for mutual feedback between the student and teacher models. Thus, we propose an innovative Competitive Multi-modal Distillation framework (CoMD), which captures bidirectional feedback between teacher and student models and continually updates the multi-modal capabilities that the student model has learned. It comprises two stages: multi-modal pre-training and multi-modal competitive distillation. The first stage pre-trains the student model on a large number of filtered multi-modal datasets. The second stage facilitates a bidirectional knowledge transfer between the student and teacher models. Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model. Finally, the 7B-sized student model after four distillations surpassed the current state-of-the-art model LLaVA-13B on the ScienceQA and LLaVA Test dataset, also outperforms other strong baselines in the zero-shot setting.
翻译:近期,基于大型语言模型(LLM)的视觉指令微调研究推动多模态内容生成受到研究者广泛关注。为提升此类LLM的性能与泛化能力,从预训练多模态模型(教师模型)向更紧凑的多模态LLM(学生模型)进行知识蒸馏的实践已获得显著关注。然而,当前多模态LLM知识蒸馏中主流的指令微调范式存在资源密集且单向传递的问题,忽视了学生模型与教师模型之间相互反馈的潜力。为此,我们提出创新的竞争性多模态蒸馏框架(CoMD),该框架捕获教师模型与学生模型间的双向反馈,并持续更新学生模型已学的多模态能力。该框架包含两个阶段:多模态预训练与多模态竞争性蒸馏。第一阶段在海量过滤后的多模态数据集上预训练学生模型;第二阶段实现学生模型与教师模型间的双向知识迁移。跨多样数据集的实验分析表明,我们的知识迁移方法能持续提升学生模型能力。最终,经过四次蒸馏后的7B规模学生模型在ScienceQA与LLaVA测试数据集上超越当前最先进的LLaVA-13B模型,并在零样本设置中优于其他强基线模型。