Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
翻译:由于训练数据的缺乏,利用合成数据进行训练的半监督学习方法已被广泛用于开发自动译后编辑模型。为此,我们专注于通过数据合成方法创建高质量的合成数据。鉴于自动译后编辑以可能包含错误的机器翻译结果作为输入,我们提出了一种数据合成方法,使生成的合成数据能够模拟实际数据中存在的翻译错误。我们通过改进掩码语言模型方法,引入了一种基于噪声注入的数据合成技术,即通过用错误标记填充掩码标记,从纯净文本生成噪声文本。此外,我们提出了选择性语料库交错方法,通过仅选取优势样本将两个独立的合成数据集相结合,从而进一步提升合成数据的质量。实验结果表明,使用本方法创建的合成数据所获得的自动译后编辑性能,显著优于现有方法生成的其他合成数据。