Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are widely used for cross-lingual transfer learning. While these are pretrained to represent hundreds of languages, end users of NLP systems are often interested only in individual languages. For such purposes, the MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost. We thus propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer. This is achieved by distilling the MMT bilingually, i.e., using data from only the source and target language of interest. Specifically, we use a two-phase distillation approach, termed BiStil: (i) the first phase distils a general bilingual model from the MMT, while (ii) the second, task-specific phase sparsely fine-tunes the bilingual "student" model using a task-tuned variant of the original MMT as its "teacher". We evaluate this distillation technique in zero-shot cross-lingual transfer across a number of standard cross-lingual benchmarks. The key results indicate that the distilled models exhibit minimal degradation in target language performance relative to the base MMT despite being significantly smaller and faster. Furthermore, we find that they outperform multilingually distilled models such as DistilmBERT and MiniLMv2 while having a very modest training budget in comparison, even on a per-language basis. We also show that bilingual models distilled from MMTs greatly outperform bilingual models trained from scratch. Our code and models are available at https://github.com/AlanAnsell/bistil.
翻译:大规模多语言Transformer(MMT),如mBERT和XLM-R,被广泛用于跨语言迁移学习。尽管这些模型经过预训练可表征数百种语言,但NLP系统的终端用户通常只关注个别语言。对于此类任务,MMT的语言覆盖能力使其在模型规模、推理时间、能耗及硬件成本方面过于昂贵。为此,我们提出从MMT中提取压缩后的语言专用模型,使其保留原始MMT的跨语言迁移能力。该方法通过双语蒸馏实现,即仅使用目标源语言和目标语言的数据。具体地,我们采用名为BiStil的两阶段蒸馏策略:(i)第一阶段从MMT中蒸馏出通用双语模型;(ii)第二阶段基于任务特性,利用原始MMT的任务微调变体作为"教师",稀疏微调双语"学生"模型。我们在多个标准跨语言基准上评估了该零样本跨语言迁移蒸馏技术。关键结果表明,尽管蒸馏模型显著更小且更快,但其在目标语言性能上相比基础MMT仅产生极小退化。此外,我们发现这些模型优于DistilmBERT和MiniLMv2等多语言蒸馏模型,同时训练预算(即使按单语言计算)也极为有限。我们还证明,从MMT蒸馏得到的双语模型大幅优于从零训练的双语模型。我们的代码和模型公开于https://github.com/AlanAnsell/bistil。