This paper proposes an efficient and semi-automated method for human-in-the-loop post-editing for machine translation (MT) corpus generation. The method is based on online training of a custom MT quality estimation metric on-the-fly as linguists perform post-edits. The online estimator is used to prioritize worse hypotheses for post-editing, and auto-close best hypotheses without post-editing. This way, significant improvements can be achieved in the resulting quality of post-edits at a lower cost due to reduced human involvement. The trained estimator can also provide an online sanity check mechanism for post-edits and remove the need for additional linguists to review them or work on the same hypotheses. In this paper, the effect of prioritizing with the proposed method on the resulting MT corpus quality is presented versus scheduling hypotheses randomly. As demonstrated by experiments, the proposed method improves the lifecycle of MT models by focusing the linguist effort on production samples and hypotheses, which matter most for expanding MT corpora to be used for re-training them.
翻译:本文提出了一种高效且半自动化的人机协同后编辑方法,用于机器翻译(MT)语料库生成。该方法基于在语言学家执行后编辑过程中对自定义机器翻译质量评估指标进行在线训练。在线评估器优先排序质量较差的译文供后编辑使用,并自动关闭质量最优的译文而无需后编辑。通过这种方式,可以在降低人力投入成本的同时显著提升后编辑结果的最终质量。训练后的评估器还可为后编辑提供在线合理性检查机制,无需额外语言学家进行复核或对相同译文重复工作。本文展示了使用所提方法进行优先级排序与随机排序对机器翻译语料库质量的影响效果。实验证明,该方法通过将语言学家的工作聚焦于对扩展机器翻译语料库最具价值的样本与译文,有效提升了机器翻译模型的生命周期——这些语料库将被用于模型的重新训练。