From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

翻译：知识蒸馏方法在将大型预训练语言模型压缩为更小模型方面至关重要，可在不显著降低性能的前提下确保计算效率。传统的知识蒸馏技术假设教师（源）模型与学生（目标）模型之间具有模态同质性。另一方面，现有的多模态知识蒸馏方法需要对教师模型进行特定模态的预训练，这在大多数情况下计算上不可行。本文提出ARMADA，一种高效的跨模态知识蒸馏框架，旨在将大型视觉-语言模型（包括黑盒模型）的知识迁移至纯语言模型。与依赖多模态教师内部结构或需要计算成本高昂的预训练的现有知识蒸馏技术不同，ARMADA利用新颖的对齐技术在不修改教师模型的情况下蒸馏知识，确保效率与可扩展性。我们在十二项自然语言理解、八项复杂生成推理和五项指令微调任务上对ARMADA进行了实证验证，结果表明其在DeBERTa-v2-1.4B、OPT-1.3B、LLaMA-{3B, 7B, 8B}等大型模型中均能实现稳定的性能提升。ARMADA在语言理解任务上最高可获得3.4%的性能提升，在生成推理任务上最高提升2.6%，且均无需昂贵的多模态预训练或教师模型微调。我们的研究结果挑战了传统的知识蒸馏范式，证明即使视觉-语言模型缺乏直接的文本理解能力，经过适当蒸馏后仍能显著增强语言模型的性能。