Synthetic Pre-Training Tasks for Neural Machine Translation

Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.

翻译：利用大规模爬取语料进行预训练可能导致毒性、偏见、版权及隐私等问题。缓解此类问题的一个有前景的方案是采用合成任务与数据进行预训练，因为模型不会摄入任何真实世界信息。本文旨在探究利用合成资源进行模型预训练时影响其有效性的因素，特别是针对神经机器翻译场景。我们提出了若干种涉及不同词汇与结构知识层次的翻译模型预训练新方法，包括：1）从大规模平行语料生成混淆数据；2）拼接从小型词对齐语料中提取的短语对；3）在不使用真实人类语言语料的情况下生成合成平行数据。多语言对的实验结果表明，即使采用高混淆度数据或纯合成平行数据，仍能实现预训练带来的性能提升。我们希望这项全面的实证分析结论能有助于阐明神经机器翻译预训练的关键要素，并为开发更高效、更少毒性的模型铺平道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/