In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.
翻译:本研究介绍了AfriNLLB,一系列用于非洲语言高效互译的轻量级模型。AfriNLLB支持15个语言对(30个翻译方向),涵盖斯瓦希里语、豪萨语、约鲁巴语、阿姆哈拉语、索马里语、祖鲁语、林加拉语、南非荷兰语、沃洛夫语、埃及阿拉伯语,以及其他非洲联盟官方语言,如阿拉伯语(现代标准阿拉伯语)、法语、葡萄牙语和西班牙语。我们的训练数据覆盖英语与13种语言之间的双向翻译,以及法语与两种语言(林加拉语和沃洛夫语)之间的双向翻译。AfriNLLB模型基于NLLB-200 600M,我们通过迭代层剪枝和量化技术对其进行压缩。我们使用为非洲语言精心整理的平行语料库对剪枝后的模型进行微调,并采用来自更大教师模型的知识蒸馏方法。本工作的目标是在资源受限的环境中实现非洲语言翻译模型的高效部署。评估结果表明,AfriNLLB模型在保持与基线模型相当性能的同时,推理速度显著提升。我们发布了两个版本的AfriNLLB模型:一个支持进一步微调的Transformers版本,以及一个用于高效推理的CTranslate2版本。此外,我们公开了用于微调基线和剪枝模型的所有训练数据,以促进后续研究。