Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
翻译:跨分词器蒸馏(CTD)是指在教师模型与学生语言模型使用不同分词器时,将知识从教师迁移至学生的技术,该问题至今仍未得到根本解决。现有方法依赖启发式策略来对齐不匹配的词汇表,引入了大量复杂性。本文提出一种简洁而有效的基准方法——字节级蒸馏(BLD),通过在所有分词器的公共接口(即字节级)进行操作来实现CTD。具体而言,我们将教师模型的输出分布转换为字节级概率,为学生模型附加轻量级字节级解码头,并通过该共享的字节级接口进行蒸馏。尽管方法简单,在处理1B至8B参数规模的各类蒸馏任务时,BLD的性能与多种显著更复杂的CTD方法相比具有竞争力,且在若干基准测试中甚至更优。我们的结果表明,字节级是跨分词器知识迁移的自然公共基础,但同时也揭示出在所有任务和基准测试中实现一致改进仍具挑战性,凸显CTD仍是一个开放性问题。