STORYSUMM：评估故事摘要的忠实度 (STORYSUMM: Evaluating Faithfulness in Story Summarization)

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

翻译：人工评估一直是检验抽象摘要忠实度的黄金标准。然而，对于像叙事这样具有挑战性的源领域，多名标注者可能一致认为某个摘要是忠实的，却忽略了那些一经指出便显属错误的细节。为此，我们引入了一个新的数据集STORYSUMM，它包含由大语言模型生成的短故事摘要，并配有局部化的忠实度标签和错误解释。该基准旨在评估各种方法，测试给定方法能否检测出具有挑战性的不一致之处。利用该数据集，我们首先表明任何一种人工标注方案都可能遗漏不一致性，因此我们主张在为摘要数据集建立真实基准时，应采用多种方法。最后，我们测试了最新的自动评估指标，发现它们在此任务上的平衡准确率均未超过70%，这表明该基准对于未来忠实度评估的研究而言是一个具有挑战性的测试平台。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日