We present BLESS, a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our analysis considers a suite of automatic metrics as well as a large-scale quantitative investigation into the types of common edit operations performed by the different models. Furthermore, we perform a manual qualitative analysis on a subset of model outputs to better gauge the quality of the generated simplifications. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines. Additionally, we find that certain LLMs demonstrate a greater range and diversity of edit operations. Our performance benchmark will be available as a resource for the development of future TS methods and evaluation metrics.
翻译:本文提出BLESS,一个针对最新最先进的大语言模型(LLMs)在文本简化(TS)任务上的综合性能基准。我们评估了未经微调的大语言模型在解决这一挑战性任务上的表现,在少样本设定下,对来自不同领域(维基百科、新闻和医学)的三个测试集上的44个模型(在规模、架构、预训练方法和可访问性上存在差异)进行了评测。我们的分析涵盖了一套自动评估指标,并对不同模型执行的常见编辑操作类型进行了大规模定量研究。此外,我们对部分模型输出进行了人工定性分析,以更好地评估生成简化文本的质量。评估结果表明,即使未经过TS任务训练,最佳的大语言模型的性能也与最先进的TS基线方法相当。此外,我们发现某些大语言模型展现出更广泛和多样化的编辑操作范围。我们的性能基准将作为未来TS方法和评估指标开发的可用资源。