Toucan: Many-to-Many Translation for 150 African Language Pairs

We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.

翻译：本研究通过引入一系列旨在改善低资源语言机器翻译的资源，填补了自然语言处理领域的一个显著空白，特别聚焦于非洲语言。首先，我们提出了两个语言模型——Cheetah-1.2B和Cheetah-3.7B，分别具有12亿和37亿参数。随后，我们对上述模型进行微调，构建了以非洲语言为中心的机器翻译模型Toucan，该模型支持156种非洲语言对的翻译。为评估Toucan性能，我们精心开发了专用于机器翻译评估的综合性基准测试集AfroLingu-MT。实验表明，Toucan在非洲语言机器翻译任务上显著优于现有模型，展现出卓越的性能。最后，我们训练了新型评估模型spBLEU-1K，该模型覆盖包括614种非洲语言在内的1000种语言，旨在提升翻译评估指标的适用性。本工作致力于推动自然语言处理领域的发展，促进跨文化理解与知识交流，特别是在非洲等语言资源有限的地区。Toucan项目的GitHub仓库地址为：https://github.com/UBC-NLP/Toucan。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日