The split and rephrase (SR) task aims to divide a long, complex sentence into a set of shorter, simpler sentences that convey the same meaning. This challenging problem in NLP has gained increased attention recently because of its benefits as a pre-processing step in other NLP tasks. Evaluating quality of SR is challenging, as there no automatic metric fit to evaluate this task. In this work, we introduce CEScore, as novel statistical model to automatically evaluate SR task. By mimicking the way humans evaluate SR, CEScore provides 4 metrics (Sscore, Gscore, Mscore, and CEscore) to assess simplicity, grammaticality, meaning preservation, and overall quality, respectively. In experiments with 26 models, CEScore correlates strongly with human evaluations, achieving 0.98 in Spearman correlations at model-level. This underscores the potential of CEScore as a simple and effective metric for assessing the overall quality of SR models.
翻译:拆分与重述(Split and Rephrase,SR)任务旨在将一个长句、复杂句拆分为一组更短、更简单的句子,同时保留原意。这一自然语言处理(NLP)中的难题近年来因作为其他NLP任务的预处理步骤优势而受到更多关注。由于缺乏适合评估该任务的自动指标,SR质量评估具有挑战性。本文提出CEScore这一新颖统计模型,通过模拟人工评估SR的方式,自动提供四项指标(Sscore、Gscore、Mscore和CEscore),分别评估简洁性、语法正确性、语义保留性和整体质量。在26个模型的实验中,CEScore与人工评估结果高度相关,模型级斯皮尔曼相关系数达0.98。这凸显了CEScore作为评估SR模型整体质量的简单有效指标的潜力。