Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.
翻译:知识蒸馏已成为压缩大型语言模型的主流技术。现有KD方法受限于师生模型需使用相同分词器的约束,限制了其在处理不同架构系列LLM时的通用性。本文提出多层级最优传输方法,通过推进最优传输理论实现通用跨分词器知识蒸馏。该方法利用多样化成本矩阵在词元层级和序列层级对齐师生模型的逻辑值分布,无需维度或逐词元对应关系。在词元层级,MultiLevelOT通过联合优化序列内所有词元整合全局与局部信息以增强鲁棒性。在序列层级,我们通过Sinkhorn距离高效捕捉逻辑值的复杂分布结构,该距离可近似Wasserstein距离用于散度度量。在抽取式问答、生成式问答和摘要生成等任务上的大量实验表明,MultiLevelOT在各种设定下均优于最先进的跨分词器KD方法。本方法对跨模型系列、架构和参数规模的师生模型均表现出鲁棒性。