High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but are less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
翻译:高质量的机器翻译(MT)评估在很大程度上依赖于人工判断。全面的错误分类方法,如多维质量度量(MQM),因其耗时且只能由专家完成而成本高昂,而专家的可用性可能有限,尤其是在低资源语言方面。另一方面,仅分配总体分数,如直接评估(DA),则更简单、更快,并且可以由任何水平的译者完成,但可靠性较低。在本文中,我们介绍了错误跨度标注(ESA),这是一种人工评估协议,它将DA的连续评分与MQM的高级错误严重性跨度标记相结合。我们通过将ESA与MQM和DA针对WMT23的12个MT系统和一个人类参考翻译(英语到德语)进行比较来验证ESA。结果表明,在相同质量水平下,ESA提供了比MQM更快、更便宜的标注,且无需昂贵的MQM专家。