The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metrics' performance on individual error types.
翻译:自动摘要忠实度评估指标的激增催生了用于评估这些指标的基准需求。尽管现有基准通过测量指标与人工对模型生成摘要的忠实度判断之间的相关性,但不足以诊断指标是否满足以下要求:1)一致性,即随着摘要中引入错误,能够反映忠实度降低;2)对人工撰写的文本有效;3)对不同错误类型敏感(因为摘要可能包含多重错误)。为满足这些需求,我们提出了不忠实最小对基准(BUMP),这是一个包含889对人工撰写的、最小差异摘要对的数据集,其中通过向CNN/DailyMail数据集中的原始摘要引入单一错误来生成不忠实摘要。我们发现BUMP在多个方面补充了现有基准:1)BUMP中的摘要更难区分,且在当前最优摘要模型下的生成概率更低;2)不同于非成对数据集,BUMP可用于衡量指标的一致性,并揭示最具区分力的指标往往并非最一致;3)不同于包含多重错误生成摘要的数据集,BUMP能够衡量指标对单个错误类型的表现。