利用单语与平行数据进行低资源机器翻译的预训练策略研究 (Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation)

This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.

翻译：本研究探讨了针对低资源语言定制机器翻译模型时多种预训练策略的有效性。尽管研究涉及多种低资源语言（包括南非荷兰语、斯瓦希里语和祖鲁语），但翻译模型专门针对资源匮乏的非洲语言林加拉语开发，并基于Reid与Artetxe（2021）提出的、原为高资源语言设计的预训练方法进行改进。通过一系列综合实验，我们探索了不同的预训练方法，包括多语言整合以及在预训练阶段同时使用单语数据与平行数据。研究结果表明，多语言预训练及综合利用单语与平行数据能显著提升翻译质量。本研究为低资源机器翻译的有效预训练策略提供了重要见解，有助于缩小高资源语言与低资源语言之间的性能差距。这些成果有助于推动为边缘化社区和代表性不足群体开发更具包容性与准确性的自然语言处理模型的宏观目标。本研究使用的代码与数据集已公开，以促进后续研究并确保可复现性，但部分数据可能因公开获取渠道变更而无法继续访问。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日