This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model.
翻译:本文介绍了两个面向南非多种语言的政府主题多语言语料库。这些语料库通过收集南非政府报纸(Vuk'uzenzele)以及译成全部11种南非官方语言的政府演讲稿(ZA-gov-multilingual)构建而成。该语料库可用于多种下游自然语言处理任务,旨在使研究者能够研究南非政府出版物中的语言使用情况,重点关注政府官员如何与民众进行沟通。本文重点阐述了语料收集、清洗及公开的流程。我们利用语言无关句子表示(LASER)嵌入技术,为神经机器翻译(NMT)任务构建了平行句库。基于这些对齐的句子,我们通过微调大规模多语言预训练语言模型,为9种本土语言提供了神经机器翻译基准测试。