Neural models for abstractive summarization tend to generate output that is fluent and well-formed but lacks semantic faithfulness, or factuality, with respect to the input documents. In this paper, we analyze the tradeoff between abstractiveness and factuality of generated summaries across multiple datasets and models, using extensive human evaluations of factuality. In our analysis, we visualize the rates of change in factuality as we gradually increase abstractiveness using a decoding constraint, and we observe that, while increased abstractiveness generally leads to a drop in factuality, the rate of factuality decay depends on factors such as the data that the system was trained on. We introduce two datasets with human factuality judgements; one containing 10.2k generated summaries with systematically varied degrees of abstractiveness; the other containing 4.2k summaries from five different summarization models. We propose new factuality metrics that adjust for the degree of abstractiveness, and we use them to compare the abstractiveness-adjusted factuality of previous summarization works, providing baselines for future work.
翻译:神经模型在抽象式摘要生成中往往能输出流畅且结构良好的文本,但缺乏对输入文档的语义忠实性(即事实性)。本文通过大量人工事实性评估,分析了多个数据集和模型所生成摘要的抽象程度与事实性之间的权衡关系。在分析中,我们利用解码约束逐步提升摘要的抽象程度,并可视化事实性随抽象程度增加的变化率。研究发现,尽管抽象程度提升通常导致事实性下降,但事实性衰减速率取决于系统训练数据的特性等因素。我们引入了两个包含人工事实性标注的数据集:一个包含10.2k条具有系统变化的抽象程度的生成摘要,另一个包含来自五种不同摘要模型的4.2k条摘要。我们提出了新的事实性评估指标,该指标可根据抽象程度进行调整,并利用这些指标比较了以往摘要工作的抽象程度调整后的事实性,为未来研究提供了基线。