An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human Assessment

Sentence simplification, which rewrites a sentence to be easier to read and understand, is a promising technique to help people with various reading difficulties. With the rise of advanced large language models (LLMs), evaluating their performance in sentence simplification has become imperative. Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the GPT-4's simplification capabilities. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's struggles with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that while these metrics are effective for significant quality differences, they lack sufficient sensitivity to assess the overall high-quality simplification by GPT-4.

翻译：句子简化旨在重写句子以使其更易于阅读和理解，是帮助存在各种阅读困难人群的有前景技术。随着先进大语言模型的兴起，评估其在句子简化中的表现变得至关重要。近期研究已采用自动评估指标和人工评估两种方式衡量大语言模型的简化能力。然而，现有评估方法对大语言模型的适用性仍存疑。首先，当前自动指标在评估大语言模型简化效果时的适用性尚未明确。其次，现有句子简化的人工评估方法常陷入两种极端：要么过于浅显，无法清晰揭示模型表现；要么过于复杂详细，导致标注过程繁琐且易产生不一致性，进而影响评估可靠性。为解决这些问题，本研究在确保评估可靠性的同时，深入揭示大语言模型的表现。我们设计了基于错误的人工标注框架，评估GPT-4的简化能力。结果表明，相较于当前最先进模型，GPT-4生成错误简化输出的比例整体更低。但大语言模型仍存在局限性，例如GPT-4在词汇释义方面表现欠佳。此外，我们基于人工标注结果对广泛使用的自动评估指标进行元评估，发现这些指标虽能有效识别显著质量差异，但在评估GPT-4整体高质量简化输出时缺乏足够敏感性。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日