Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose a novel variant that gives metrics credit for correctly predicting ties, as well as an optimization procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer Kendall-based assessments of metric performance.
翻译:摘要:肯德尔τ系数常被用于元评估机器翻译(MT)评估指标对单个译文打分的质量。该方法聚焦于成对分数比较的直观性,但引发了如何处理并列评分的疑问——这一灰色地带促使学界提出了多种变体。我们证明,在现代MT元评估等场景中,现有变体因处理并列评分的方式存在缺陷,某些情况下甚至可能被操纵。我们提出一种新型变体,该变体对指标正确预测并列评分的情况给予权重,同时提出一种优化方法,可自动在指标评分中引入并列评分,从而实现预测并列评分与不预测并列评分指标间的公平比较。我们通过论证与实验证据表明,这些改进能基于肯德尔方法对指标性能进行更公平的评估。