Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets for text simplification have limited annotations that are based on unitary or outdated models, making them unsuitable for this approach. To address these issues, we introduce the SimpEval corpus that contains: SimpEval_past, comprising 12K human ratings on 2.4K simplifications of 24 past systems, and SimpEval_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including GPT-3.5 generated text. Training on SimpEval, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates much better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. We also introduce Rank and Rate, a human evaluation framework that rates simplifications from several models in a list-wise manner using an interactive interface, which ensures both consistency and accuracy in the evaluation process and is used to create the SimpEval datasets.
翻译:利用现代语言模型训练可学习指标,近来已成为机器翻译自动评估的一种有前景的方法。然而,现有文本简化人工评估数据集基于单一或过时模型,标注有限,不适用于这一方法。为解决这些问题,我们引入了SimpEval语料库,该语料库包含:SimpEval_past(涵盖24个过去系统对2.4万条简化文本的1.2万个人工评分)和SimpEval_2022(一个具有挑战性的简化基准,包含对360条简化文本(含GPT-3.5生成文本)的1000多个人工评分)。基于SimpEval训练,我们提出了LENS——一种可学习的文本简化评估指标。大量实证结果表明,LENS与人类判断的相关性远高于现有指标,为文本简化评估的未来进展铺平了道路。我们还引入了Rank and Rate,一种人工评估框架,该框架通过交互式界面以列表方式对多个模型的简化结果进行评分,确保了评估过程的一致性和准确性,并用于创建SimpEval数据集。