Traditional evaluation metrics like ROUGE compare lexical overlap between the reference and generated summaries without taking argumentative structure into account, which is important for legal summaries. In this paper, we propose a novel legal summarization evaluation framework that utilizes GPT-4 to generate a set of question-answer pairs that cover main points and information in the reference summary. GPT-4 is then used to generate answers based on the generated summary for the questions from the reference summary. Finally, GPT-4 grades the answers from the reference summary and the generated summary. We examined the correlation between GPT-4 grading with human grading. The results suggest that this question-answering approach with GPT-4 can be a useful tool for gauging the quality of the summary.
翻译:传统评估指标如ROUGE通过比较参考摘要与生成摘要之间的词汇重叠度,但未考虑论辩结构——这对法律摘要至关重要。本文提出了一种新型法律摘要评估框架:首先利用GPT-4生成覆盖参考摘要核心要点与信息的问题-答案对;继而基于生成摘要,由GPT-4对参考摘要中的问题生成答案;最终,GPT-4对参考摘要与生成摘要的答案进行评分。我们分析了GPT-4评分与人类评分的相关性,结果表明,基于GPT-4的问答方法可作为衡量摘要质量的有效工具。