Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.
翻译:基于N元语法的自动评估指标(如ROUGE)被广泛用于摘要生成等生成式任务的评估。尽管这些指标对于英语被认为能(即使不完美地)指示人工评估结果,但其对其他语言的适用性仍不明确。为此,我们系统性地评估了生成任务中基于N元语法和基于神经网络的评估指标,以检验其在跨语言和跨任务中的有效性。具体而言,我们设计了一个涵盖四种类型学语系(黏着语、孤立语、低融合语、高融合语)共八种语言的大规模评估套件,覆盖低资源与高资源场景,用以分析这些指标与人工评判的相关性。我们的研究结果突显了评估指标对语言类型的敏感性。例如,在融合性语言中,基于N元语法的指标与人工评估的相关性低于孤立语和黏着语。我们还证明,适当的词元化处理能显著缓解形态丰富的融合性语言中的这一问题,有时甚至能逆转负面趋势。此外,我们发现专门为评估任务训练的基于神经网络的指标(如COMET)始终优于其他神经指标,并且在低资源语言中与人工评判具有更好的相关性。总体而言,我们的分析揭示了N元语法指标在融合性语言中的局限性,并主张加大对为评估任务训练的神经网络指标的投资。