The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.
翻译:近期发布的NLLB-200是一组覆盖202种语言的多语言神经机器翻译模型。其中最大的模型基于混合专家架构,在众多语言对中实现了最先进水平,包含545亿参数,仅推理就需要至少4块32GB GPU。本研究提出一种剪枝方法,无需进一步微调即可移除高达80%的专家,且翻译质量损失可忽略不计,这使得模型可在单块32GB GPU上运行。进一步分析表明,我们的剪枝指标能够识别语言特定专家。