Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.
翻译:能否在不依赖人工标注评分的情况下,训练出通用的文本生成质量评估指标?现有学习型指标要么在各类文本生成任务中表现不理想,要么需要针对特定任务的人工评分进行训练。本文提出SESCORE2——一种基于自监督学习的模型化文本生成评估指标训练方法。其核心思想是通过扰动语料库中检索到的句子,合成具有真实性的模型错误。SESCORE2的主要优势在于易于扩展至多种语言,同时提供可靠的错误严重程度评估。我们在三种语言的四项文本生成任务上对SESCORE2及既有方法进行了评估。SESCORE2在四个文本生成评估基准中全面超越无监督指标PRISM,Kendall相关系数提升0.078。令人惊讶的是,SESCORE2在多项文本生成任务上甚至优于有监督指标BLEURT和COMET。代码与数据已开源至https://github.com/xu1998hz/SEScore2。