Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
翻译:鉴于人工评估成本高昂且耗时,自动化评估对于简化文本摘要基准测试和模型开发至关重要。ROUGE等传统方法与人类判断相关性不佳,而近期提出的基于大语言模型的评估方法仅能使用李克特量表进行摘要级别的评估。这限制了对模型的深入分析,例如我们只能在摘要级别分配一个幻觉分数,而在句子级别可以统计包含幻觉的句子数量。为弥补这些不足,我们提出了FineSurE——一个专门针对摘要任务设计的、基于大语言模型的细粒度评估器。除了忠实性标准外,它还采用完整性和简洁性标准,从而实现多维度评估。我们比较了各种开源和专有的大语言模型作为FineSurE的骨干网络。此外,我们对FineSurE与包括基于NLI、基于QA和基于大语言模型在内的最先进方法进行了广泛的基准测试,结果显示其在完整性和简洁性维度上性能提升尤为显著。代码发布于https://github.com/DISL-Lab/FineSurE-ACL24。