The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation and find that the measure using global grouping and Pearson correlation exhibits the best overall performance, involving the discriminative power, ranking consistency, and sensitivity to score granularity.
翻译:自然语言生成(NLG)自动评估指标与人工评估之间的相关性常被视为衡量评估指标能力的关键标准。然而,不同的分组方法和相关系数导致了元评估中使用的各类相关性度量。在具体评估场景中,先前研究通常直接沿用传统度量设置,但这些度量之间的特性与差异尚未得到充分关注。为此,本文基于六个广泛使用的NLG评估数据集和32个评估指标的大量真实数据,分析了12种常见相关性度量,揭示出不同度量确实会影响元评估结果。此外,我们提出了反映元评估能力的三个维度,并发现采用全局分组与皮尔逊相关系数的度量在判别力、排序一致性和对分数粒度的敏感性方面表现出最佳综合性能。