LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

翻译：对大型语言模型（LLM）生成摘要的可靠评估仍是一个开放性挑战，尤其在跨异构领域和文档长度场景下。我们对涵盖五个领域的七个数据集（覆盖从短篇新闻到长篇科学、政府及法律文本（2K-27K词）的文档，包含1500余条人工标注摘要）中的14种自动摘要评估指标和基于LLM的评估器进行了全面元评估。结果表明：传统词汇重叠指标（如ROUGE、BLEU）与人工判断呈弱相关或负相关，而任务特定神经指标和基于LLM的评估器实现了更高的一致性，尤其在语言质量评估方面。基于这些发现，我们提出LLM-ReSum——一种无需微调模型、将基于LLM的评估与生成整合于闭环反馈中的自反思摘要框架。在三个领域中，LLM-ReSum将低质量摘要的事实准确率提升高达33%、覆盖度提升39%，且89%的情况下人类评估者更偏好优化后的摘要。此外，我们引入了PatentSumEval——一个由180条专家评估摘要构成的法律文档摘要人工标注新基准。所有代码与数据集将在GitHub上开源。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

带入您自己的知识：大型语言模型（LLM）知识扩展方法综述

专知会员服务

38+阅读 · 2025年2月21日

【新书】解码大型语言模型：理解、实现与优化LLM在自然语言处理应用中的全面指南

专知会员服务

49+阅读 · 2024年12月13日

基于大语言模型（LLM）的合成数据生成、策展和评估的综述

专知会员服务

62+阅读 · 2024年7月5日

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

专知会员服务

35+阅读 · 2024年6月23日