Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, in combination with a tie calibration procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer ranking-based assessments of metric performance.
翻译:肯德尔τ系数常用于元评估机器翻译评估指标对单个译文的评分质量。其聚焦于成对分数比较的直观性引出了平局应如何处理的问题,这一模糊地带催生了文献中的多种变体。我们证明,在现代机器翻译元评估等场景中,现有变体因平局处理方式存在缺陷,甚至在某些情况下可被人为操纵。为此,我们提出采用结合平局校准程序的成对准确率变体进行指标元评估:该变体赋予指标正确预测平局的能力,并通过自动在指标分数中引入平局来确保能预测平局与不能预测平局的指标间可比性。我们通过理论论证与实验证据表明,这些改进能实现更公平的基于排名的指标性能评估。