Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
翻译:缺乏事实正确性仍是困扰当前最先进摘要系统的问题,尽管这些系统在生成看似流畅的摘要方面取得了显著进展。本文证明,事实不一致可能由输入文本中作为混杂因子的无关部分引起。为此,我们利用因果效应的信息论度量来量化混杂程度,并精确评估其如何影响摘要性能。基于理论结果得出的见解,我们设计了一个简单的多任务模型,在提供人工标注的相关句子时,通过利用这些标注来控制此类混杂效应。关键的是,我们给出了数据分布的一种原则性特征描述,在这些分布中混杂效应可能很大,因此需要使用人工标注的相关句子来生成事实性摘要。我们的方法在AnswerSumm数据集上将忠实度分数提升了20%——该会话摘要数据集中由于任务的主观性,缺乏忠实度是一个显著问题。我们的最佳方法在取得最高忠实度分数的同时,在ROUGE和METEOR等标准指标上也达到了最先进水平。我们通过人工评估验证了这些改进。