Traditional evaluation metrics like ROUGE compare lexical overlap between the reference and generated summaries without taking argumentative structure into account, which is important for legal summaries. In this paper, we propose a novel legal summarization evaluation framework that utilizes GPT-4 to generate a set of question-answer pairs that cover main points and information in the reference summary. GPT-4 is then used to generate answers based on the generated summary for the questions from the reference summary. Finally, GPT-4 grades the answers from the reference summary and the generated summary. We examined the correlation between GPT-4 grading with human grading. The results suggest that this question-answering approach with GPT-4 can be a useful tool for gauging the quality of the summary.
翻译:传统评估指标如ROUGE通过比较参考摘要与生成摘要之间的词汇重叠度来评估质量,但未考虑论证结构,而这对于法律摘要至关重要。本文提出了一种新颖的法律摘要评估框架,该框架利用GPT-4生成覆盖参考摘要核心要点与信息的一组问答对,继而使用GPT-4基于生成摘要对参考摘要中的问题进行作答。最后,由GPT-4对参考摘要与生成摘要的答案进行评分。我们考察了GPT-4评分与人工评分之间的相关性,结果表明这种基于GPT-4的问答方法可成为衡量摘要质量的有效工具。