Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: {\em what kind of synthetic data contributes to BT performance?} Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.
翻译:反向翻译是机器翻译(NMT)研究领域最重要的技术之一。现有关于反向翻译的尝试具有共同特征:它们都采用束搜索或随机采样方式,通过反向模型生成合成数据,但鲜有研究探讨合成数据在反向翻译性能中的作用。这促使我们提出一个根本性问题:*何种合成数据有助于提升反向翻译性能?*通过理论与实证研究,我们识别出控制反向翻译NMT性能的两个关键因素——质量与重要性。基于此发现,我们提出一种简单而有效的合成数据生成方法,以更优地平衡这两个因素,从而提升反向翻译性能。我们在WMT14德英、英德及俄英基准任务上开展了大量实验。采用所提方法生成的合成数据,我们的反向翻译模型显著优于标准反向翻译基线(即基于束搜索和采样的数据生成方法),验证了所提方法的有效性。