PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.

翻译：多语言预训练显著提升了包括机器翻译在内的多项多语言自然语言处理任务。现有方法大多基于单语数据的掩码语言建模和文本去噪目标的变体。基于单语数据的多语言预训练忽视了众多语言对中平行数据的可用性。此外，部分研究将现有人工生成的平行翻译数据纳入预训练过程。这类平行数据虽具显著价值，但在高资源语言对中也十分有限。本文提出了一种新型半监督方法SPDG，可生成高质量伪平行数据用于多语言预训练。首先，在单语数据上预训练一个去噪模型，通过调整语序、增删替换词汇来提升预训练文档质量；随后，利用词典进行逐词翻译并应用预训练去噪模型，为每个预训练文档生成不同的伪翻译版本。最终将生成的伪平行数据用于预训练我们的多语言序列到序列模型PEACH。实验表明，在监督、零样本和少样本等各类翻译任务中，PEACH在训练mT5和mBART时的性能均优于现有方法。此外，PEACH在同源语言间的知识迁移能力使其对低资源语言尤具价值。结果表明，当配备高质量词典以生成精确伪平行数据时，PEACH可为低资源语言提供重要支持。