How to properly conduct human evaluations for text summarization is a longstanding challenge. The Pyramid human evaluation protocol, which assesses content selection by breaking the reference summary into sub-units and verifying their presence in the system summary, has been widely adopted. However, it suffers from a lack of systematicity in the definition and granularity of the sub-units. We address these problems by proposing QAPyramid, which decomposes each reference summary into finer-grained question-answer (QA) pairs according to the QA-SRL framework. We collect QA-SRL annotations for reference summaries from CNN/DM and evaluate 10 summarization systems, resulting in 8.9K QA-level annotations. We show that, compared to Pyramid, QAPyramid provides more systematic and fine-grained content selection evaluation while maintaining high inter-annotator agreement without needing expert annotations. Furthermore, we propose metrics that automate the evaluation pipeline and achieve higher correlations with QAPyramid than other widely adopted metrics, allowing future work to accurately and efficiently benchmark summarization systems.
翻译:如何对文本摘要进行恰当的人工评估是一个长期存在的挑战。Pyramid人工评估协议通过将参考摘要分解为子单元并验证其在系统摘要中的存在性来评估内容选择,已被广泛采用。然而,该协议在子单元的定义与粒度方面缺乏系统性。我们通过提出QAPyramid来解决这些问题,该方法依据QA-SRL框架将每个参考摘要分解为更细粒度的问答对。我们为CNN/DM数据集中的参考摘要收集了QA-SRL标注,并评估了10个摘要系统,共获得8.9K个问答级标注。实验表明,与Pyramid相比,QAPyramid在保持较高标注者间一致性且无需专家标注的前提下,提供了更系统化、更细粒度的内容选择评估。此外,我们提出了可自动化评估流程的指标,其与QAPyramid的相关性高于其他广泛采用的指标,使未来工作能够准确高效地对摘要系统进行基准测试。