Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to 'hallucinations', low performance on non-news genres, and outputs which are not exactly summaries. Targeting ACL 2023's 'Reality Check' theme, we present GUMSum, a small but carefully crafted dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization. Summaries are highly constrained, focusing on substitutive potential, factuality, and faithfulness. We present guidelines and evaluate human agreement as well as subjective judgments on recent system outputs, comparing general-domain untuned approaches, a fine-tuned one, and a prompt-based approach, to human performance. Results show that while GPT3 achieves impressive scores, it still underperforms humans, with varying quality across genres. Human judgments reveal different types of errors in supervised, prompted, and human-generated summaries, shedding light on the challenges of producing a good summary.
翻译:基于预训练语言模型的自动摘要虽能生成流畅性极佳的文本,但易出现“幻觉”现象、在非新闻类文本上表现欠佳,且输出内容未必符合摘要本质。为响应ACL 2023“现实检验”主题,我们提出GUMSum——一个规模虽小但精挑细选的多体裁英文摘要数据集,涵盖12种书面与口语体裁,用于评估抽象式摘要性能。摘要严格遵循约束条件,重点关注可替代性、事实准确性与忠实度。本文制定了标注指南,评估了人工标注一致性,并针对近期系统输出结果开展主观评判,将通用领域无调参方法、微调方法及基于提示的方法与人工表现进行比较。结果表明,尽管GPT3取得了令人瞩目的分数,但其表现仍逊于人类,且各体裁质量参差不齐。人工评判揭示了监督式、提示式及人工生成摘要中不同类型的错误,为产出高质量摘要所面临的挑战提供了洞察。