Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Recently, multi-modal content generation has attracted lots of attention from researchers by investigating the utilization of visual instruction tuning based on large language models (LLMs). To enhance the performance and generalization ability of such LLMs, the practice of distilling knowledge from pretrained multi-modal models (a.k.a. teachers) to more compact multi-modal LLMs (students) has gained considerable interest. However, the prevailing paradigm of instructiontuning in multi-modal LLMs knowledge distillation is resource-intensive and unidirectional, neglecting the potential for mutual feedback between the student and teacher models. Thus, we propose an innovative Competitive Multi-modal Distillation framework (CoMD), which captures bidirectional feedback between teacher and student models and continually updates the multi-modal capabilities that the student model has learned. It comprises two stages: multi-modal pre-training and multi-modal competitive distillation. The first stage pre-trains the student model on a large number of filtered multi-modal datasets. The second stage facilitates a bidirectional knowledge transfer between the student and teacher models. Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model. Finally, the 7B-sized student model after four distillations surpassed the current state-of-the-art model LLaVA-13B on the ScienceQA and LLaVA Test dataset, also outperforms other strong baselines in the zero-shot setting.

翻译：近期，基于大型语言模型（LLM）的视觉指令微调研究推动多模态内容生成受到研究者广泛关注。为提升此类LLM的性能与泛化能力，从预训练多模态模型（教师模型）向更紧凑的多模态LLM（学生模型）进行知识蒸馏的实践已获得显著关注。然而，当前多模态LLM知识蒸馏中主流的指令微调范式存在资源密集且单向传递的问题，忽视了学生模型与教师模型之间相互反馈的潜力。为此，我们提出创新的竞争性多模态蒸馏框架（CoMD），该框架捕获教师模型与学生模型间的双向反馈，并持续更新学生模型已学的多模态能力。该框架包含两个阶段：多模态预训练与多模态竞争性蒸馏。第一阶段在海量过滤后的多模态数据集上预训练学生模型；第二阶段实现学生模型与教师模型间的双向知识迁移。跨多样数据集的实验分析表明，我们的知识迁移方法能持续提升学生模型能力。最终，经过四次蒸馏后的7B规模学生模型在ScienceQA与LLaVA测试数据集上超越当前最先进的LLaVA-13B模型，并在零样本设置中优于其他强基线模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日