Automatic Post-Editing (APE) is the task of automatically identifying and correcting errors in the Machine Translation (MT) outputs. We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the MT training data. We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model. To the best of our knowledge, this is a novel adaptation of APE and QE to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our work is not limited by the characteristics of English or Marathi languages; and is language pair-agnostic, given the necessary QE and APE data.
翻译:自动后编辑(APE)旨在自动识别并修正机器翻译(MT)输出中的错误。本研究提出一种"修复-筛选-使用"方法论,即利用APE系统对MT训练数据的目标端错误进行修正。我们基于质量评估(QE)模型计算的评分,从原始句子对与修正后句子对中筛选语料。据我们所知,这是首次将APE与QE技术创新性地应用于从伪平行语料中提取高质量平行语料。通过使用经筛选语料进行训练,机器翻译系统在英语-马拉地语和马拉地语-英语方向上的BLEU值分别较基准模型提升5.64和9.91分。该基准模型为基于全部伪平行语料训练的模型。我们的方法不受英语或马拉地语语言特性的限制,在具备必要QE与APE数据的前提下,可适用于任意语言对。