Compared to conventional bilingual translation systems, massively multilingual machine translation is appealing because a single model can translate into multiple languages and benefit from knowledge transfer for low resource languages. On the other hand, massively multilingual models suffer from the curse of multilinguality, unless scaling their size massively, which increases their training and inference costs. Sparse Mixture-of-Experts models are a way to drastically increase model capacity without the need for a proportional amount of computing. The recently released NLLB-200 is an example of such a model. It covers 202 languages but requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that allows the removal of up to 80\% of experts with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics allow to identify language-specific experts and prune non-relevant experts for a given language pair.
翻译:与传统的双语翻译系统相比,大规模多语言机器翻译具有显著优势:单一模型即可支持多种语言翻译,并能通过知识迁移提升低资源语言的翻译性能。然而,多语言模型仍面临"多语言诅咒"的挑战——除非大幅扩大模型规模(这会显著增加训练与推理成本),否则难以突破性能瓶颈。稀疏混合专家模型为此提供了解决方案:在无需等比例增加计算量的前提下显著提升模型容量。近期发布的NLLB-200即属此类模型,其覆盖202种语言,但即便仅执行推理任务,也至少需要四块32GB GPU。本文提出一种剪枝方法,可在翻译质量损失可忽略的前提下删除高达80%的专家,使模型得以在单块32GB GPU上运行。进一步分析表明,我们提出的剪枝指标能够有效识别语言特异性专家,并移除特定语言对中非相关的专家。