LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.

翻译：尽管人工评估仍是准确评判自动生成摘要忠实性的最佳实践，但在评估长文本摘要时，针对其难度和工作量增加的问题，现有解决方案仍较为匮乏。通过对162篇长文本摘要相关论文的调研，我们首先揭示了当前长文本摘要人工评估的实践现状。研究发现，其中73%的论文未对模型生成的摘要进行任何人工评估，而其他研究则面临长文档处理时出现的全新难题（如标注者间一致性较低）。基于调研结果，我们提出LongEval——一套针对长文本摘要忠实性人工评估的指南，旨在解决以下挑战：(1) 如何实现忠实性评分的高标注者间一致性？(2) 如何在保持准确忠实性评分的同时最小化标注者工作量？(3) 自动化的摘要-原文片段对齐能否为人工评估提供帮助？我们在两个不同领域（SQuALITY与PubMed）的长文本摘要数据集上部署LongEval进行标注实验，发现采用更细粒度的判断层级（如子句级）可降低忠实性评分的标注者间方差（例如标准差从18.5降至6.8）。我们还证明，对细粒度单元进行部分标注所得分数与完整标注工作量的分数高度相关（使用50%判断量时Kendall's tau达0.89）。我们公开发布人工判断数据、标注模板及配套软件（Python库），以支持未来研究。