Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.
翻译:对大型语言模型(LLM)生成摘要的可靠评估仍是一个开放性挑战,尤其在跨异构领域和文档长度场景下。我们对涵盖五个领域的七个数据集(覆盖从短篇新闻到长篇科学、政府及法律文本(2K-27K词)的文档,包含1500余条人工标注摘要)中的14种自动摘要评估指标和基于LLM的评估器进行了全面元评估。结果表明:传统词汇重叠指标(如ROUGE、BLEU)与人工判断呈弱相关或负相关,而任务特定神经指标和基于LLM的评估器实现了更高的一致性,尤其在语言质量评估方面。基于这些发现,我们提出LLM-ReSum——一种无需微调模型、将基于LLM的评估与生成整合于闭环反馈中的自反思摘要框架。在三个领域中,LLM-ReSum将低质量摘要的事实准确率提升高达33%、覆盖度提升39%,且89%的情况下人类评估者更偏好优化后的摘要。此外,我们引入了PatentSumEval——一个由180条专家评估摘要构成的法律文档摘要人工标注新基准。所有代码与数据集将在GitHub上开源。