Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.
翻译:利用大规模爬取语料进行预训练可能导致毒性、偏见、版权及隐私等问题。缓解此类问题的一个有前景的方案是采用合成任务与数据进行预训练,因为模型不会摄入任何真实世界信息。本文旨在探究利用合成资源进行模型预训练时影响其有效性的因素,特别是针对神经机器翻译场景。我们提出了若干种涉及不同词汇与结构知识层次的翻译模型预训练新方法,包括:1)从大规模平行语料生成混淆数据;2)拼接从小型词对齐语料中提取的短语对;3)在不使用真实人类语言语料的情况下生成合成平行数据。多语言对的实验结果表明,即使采用高混淆度数据或纯合成平行数据,仍能实现预训练带来的性能提升。我们希望这项全面的实证分析结论能有助于阐明神经机器翻译预训练的关键要素,并为开发更高效、更少毒性的模型铺平道路。