We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.
翻译:我们提出了一种新颖的方法论(即MuLER),可将任何基于参考的文本生成评估指标(如机器翻译指标)转化为细粒度分析工具。给定一个系统和一个指标,MuLER量化该指标对特定错误类型(例如地名翻译错误)的惩罚程度。由此,MuLER能够实现细致的错误分析,从而针对特定现象进行针对性改进。我们在合成环境和自然场景中开展实验,验证MuLER的有效性,并展示其在机器翻译评估及摘要等其他任务中的可用性。通过分析2014-2020年WMT的所有提交结果,我们发现了一致性趋势。例如,名词和动词是最常见的词性标签,但同时也是最难翻译的词性。大多数词性标签的性能随系统整体性能提升而改善,但少数标签并不遵循此规律(其具体类型因语言而异)。针对摘要任务的初步实验也揭示了类似趋势。