Sentence simplification, which rewrites a sentence to be easier to read and understand, is a promising technique to help people with various reading difficulties. With the rise of advanced large language models (LLMs), evaluating their performance in sentence simplification has become imperative. Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the GPT-4's simplification capabilities. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's struggles with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that while these metrics are effective for significant quality differences, they lack sufficient sensitivity to assess the overall high-quality simplification by GPT-4.
翻译:句子简化旨在重写句子以使其更易于阅读和理解,是帮助存在各种阅读困难人群的有前景技术。随着先进大语言模型的兴起,评估其在句子简化中的表现变得至关重要。近期研究已采用自动评估指标和人工评估两种方式衡量大语言模型的简化能力。然而,现有评估方法对大语言模型的适用性仍存疑。首先,当前自动指标在评估大语言模型简化效果时的适用性尚未明确。其次,现有句子简化的人工评估方法常陷入两种极端:要么过于浅显,无法清晰揭示模型表现;要么过于复杂详细,导致标注过程繁琐且易产生不一致性,进而影响评估可靠性。为解决这些问题,本研究在确保评估可靠性的同时,深入揭示大语言模型的表现。我们设计了基于错误的人工标注框架,评估GPT-4的简化能力。结果表明,相较于当前最先进模型,GPT-4生成错误简化输出的比例整体更低。但大语言模型仍存在局限性,例如GPT-4在词汇释义方面表现欠佳。此外,我们基于人工标注结果对广泛使用的自动评估指标进行元评估,发现这些指标虽能有效识别显著质量差异,但在评估GPT-4整体高质量简化输出时缺乏足够敏感性。