The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first study the impact of real-world hard-to-detect misalignment noise by proposing a process to simulate the realistic misalignment controlled by semantic similarity. After quantitatively analyzing the impact of simulated misalignment on machine translation, we show the limited effectiveness of widely used pre-filters to improve the translation performance, underscoring the necessity of more fine-grained ways to handle data noise. By observing the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach which leverages the model's prediction distribution to revise the training supervision from the ground-truth data over training time. Through comprehensive experiments, we show that our self-correction method not only improves translation performance in the presence of simulated misalignment noise but also proves effective for real-world noisy web-mined datasets across eight translation tasks.
翻译:网络挖掘的大规模平行数据包含大量噪声。语义不对齐作为噪声的主要来源,对机器翻译系统的训练构成了挑战。本文首先通过提出一种由语义相似度控制的现实不对齐模拟流程,研究了真实世界中难以检测的不对齐噪声的影响。在定量分析模拟不对齐对机器翻译的影响后,我们发现广泛使用的前置过滤方法对提升翻译性能的效果有限,这凸显了需要更细粒度处理数据噪声的必要性。通过观察到模型在词元级别区分不对齐与干净数据的自知识可靠性随训练不断提升,我们提出一种自校正方法,该方法利用模型的预测分布,在训练过程中动态修正来自真实数据的训练监督信号。通过大量实验,我们证明自校正方法不仅能在模拟不对齐噪声存在时提升翻译性能,还在八个翻译任务的实际网络挖掘噪声数据集上验证了其有效性。