This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.
翻译:本研究探讨了针对低资源语言定制机器翻译模型时多种预训练策略的有效性。尽管研究涉及多种低资源语言(包括南非荷兰语、斯瓦希里语和祖鲁语),但翻译模型专门针对资源匮乏的非洲语言林加拉语开发,并基于Reid与Artetxe(2021)提出的、原为高资源语言设计的预训练方法进行改进。通过一系列综合实验,我们探索了不同的预训练方法,包括多语言整合以及在预训练阶段同时使用单语数据与平行数据。研究结果表明,多语言预训练及综合利用单语与平行数据能显著提升翻译质量。本研究为低资源机器翻译的有效预训练策略提供了重要见解,有助于缩小高资源语言与低资源语言之间的性能差距。这些成果有助于推动为边缘化社区和代表性不足群体开发更具包容性与准确性的自然语言处理模型的宏观目标。本研究使用的代码与数据集已公开,以促进后续研究并确保可复现性,但部分数据可能因公开获取渠道变更而无法继续访问。