Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.

翻译：文本生成领域的核心挑战之一在于评估：人工评估成本高昂，而自动评估指标往往与人工判断存在显著差异。本文提出一种面向文本生成评估的统计模型，该模型能够解释自动指标在生成系统输出偏好排序时存在的易错特性。研究表明，现有自动指标在此场景下通常过度自信地判定系统间存在显著差异。然而，我们的模型通过有效融合人工与自动评分，可弥补自动指标易出错的缺陷。实验证明，采用该融合方法仅需传统评估中约50%的人工标注量即可获得稳健且统计显著的结果，同时在95%的情况下能复现纯人工评估的结论。我们通过对话系统、机器翻译和文本摘要三项文本生成任务验证了该方法的优势。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日