Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose "match with partial overlap and partial credit" (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.
翻译:机器翻译(MT)及自动MT评估近年来取得了显著进步,催生了众多新颖应用。自动评估技术已从生成标量质量分数演变为精准定位翻译错误并标注其类别与严重程度。然而,如何可靠衡量具备错误检测能力的自动评估器的评估能力尚无定论——文献中缺乏成熟技术。本研究探讨了跨度级精确率、召回率与F值的不同实现方式,发现看似相近的方法可能产生显著差异的排序,且某些广泛采用的技术并不适用于评估MT错误检测。我们提出基于微平均的"部分重叠部分信用匹配"(MPP)作为鲁棒的元评估策略,并公开发布其使用代码。最后,利用MPP评估当前MT错误检测技术的先进水平。