Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.
翻译:多模态大语言模型(MLLMs)展现出卓越的跨模态能力,但其庞大的参数量带来了显著的部署挑战。知识蒸馏(KD)是压缩此类模型的有效途径,然而现有方法主要依赖静态的下一词元对齐,忽视了嵌入多模态理解与生成关键能力的动态词元交互。为此,我们提出Align-TI——一个从词元交互视角设计的新型知识蒸馏框架。我们的方法基于以下洞见:MLLMs主要依赖两种交互机制:视觉-指令词元交互用于提取相关视觉信息,以及响应内部词元交互用于实现连贯生成。相应地,Align-TI包含两个核心组件:IVA通过对齐显著视觉区域,使学生模型能够模仿教师模型的指令相关视觉信息提取能力;TPA通过对齐序列化的词元间转移概率,捕捉教师模型的动态生成逻辑。大量实验证明了Align-TI的优越性。值得注意的是,我们的方法相比传统知识蒸馏实现了2.6%的相对性能提升,蒸馏得到的Align-TI-2B模型甚至以7.0%的优势超越了参数量大得多的LLaVA-1.5-7B模型,为训练参数高效的多模态大语言模型确立了新的最先进蒸馏框架。代码已发布于https://github.com/lchen1019/Align-TI。