Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets for text simplification have limited annotations that are based on unitary or outdated models, making them unsuitable for this approach. To address these issues, we introduce the SimpEval corpus that contains: SimpEval_past, comprising 12K human ratings on 2.4K simplifications of 24 past systems, and SimpEval_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including GPT-3.5 generated text. Training on SimpEval, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates much better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. We also introduce Rank and Rate, a human evaluation framework that rates simplifications from several models in a list-wise manner using an interactive interface, which ensures both consistency and accuracy in the evaluation process and is used to create the SimpEval datasets.
翻译:利用现代语言模型训练可学习指标近期已成为机器翻译自动评估的一种有前景的方法。然而,现有的文本简化人工评估数据集基于单一或过时模型,标注有限,不适用于此方法。为解决这些问题,我们引入SimpEval语料库,包含:SimpEval_past,涵盖24个历史系统对2400条简化文本的1.2万条人工评分;以及SimpEval_2022,一个具有挑战性的简化基准测试,包含对360条简化结果(含GPT-3.5生成文本)的1000余条人工评分。基于SimpEval训练,我们提出LENS——一种用于文本简化的可学习评价指标。大量实证结果表明,LENS与人类判断的相关性显著优于现有指标,为文本简化评估的未来进展铺平道路。此外,我们提出Rank and Rate人工评估框架,通过交互式界面以列表方式对多个模型的简化结果进行评分,确保评估过程的一致性与准确性,并用于构建SimpEval数据集。