We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
翻译:本研究通过引入一系列旨在改善低资源语言机器翻译的资源,填补了自然语言处理领域的一个显著空白,特别聚焦于非洲语言。首先,我们提出了两个语言模型——Cheetah-1.2B和Cheetah-3.7B,分别具有12亿和37亿参数。随后,我们对上述模型进行微调,构建了以非洲语言为中心的机器翻译模型Toucan,该模型支持156种非洲语言对的翻译。为评估Toucan性能,我们精心开发了专用于机器翻译评估的综合性基准测试集AfroLingu-MT。实验表明,Toucan在非洲语言机器翻译任务上显著优于现有模型,展现出卓越的性能。最后,我们训练了新型评估模型spBLEU-1K,该模型覆盖包括614种非洲语言在内的1000种语言,旨在提升翻译评估指标的适用性。本工作致力于推动自然语言处理领域的发展,促进跨文化理解与知识交流,特别是在非洲等语言资源有限的地区。Toucan项目的GitHub仓库地址为:https://github.com/UBC-NLP/Toucan。