State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. Hence, an important component of assessing the quality of summaries is to determine whether there is information consistency between the source and the summary. Existing approaches are typically based on lexical matching or representation-based methods. In this work, we introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared. We propose a Multiple-choice Question Answering and Generation framework, MQAG, which approximates the information consistency by computing the expected KL-divergence between summary and source answer distributions over automatically generated multiple-choice questions. This approach exploits multiple-choice answer probabilities, as predicted answer distributions can be easily compared. We conduct experiments on four summary evaluation datasets: QAG-CNNDM/XSum, XSum-Faithfulness, Podcast Assessment, and SummEval. Experiments show that MQAG (using models trained on RACE) outperforms existing evaluation methods on the majority of tasks.
翻译:当前最先进的摘要生成系统能够产生高度流畅的摘要,但这些摘要可能包含事实不一致性或源文本中不存在的信息。因此,评估摘要质量的关键环节在于判断源文本与摘要之间是否存在信息一致性。现有方法通常基于词汇匹配或表征学习技术。本研究提出一种基于标准信息论度量的替代方案,通过直接比较源文本与摘要中的信息内容,实现信息一致性评估。我们构建了多项选择问答与生成框架MQAG,该框架通过计算自动生成的多项选择题在摘要与源文本答案分布之间的期望KL散度来近似信息一致性。该方法利用多项选择答案概率的可比性优势,使得预测答案分布能够被便捷地比较。我们在四个摘要评估数据集(QAG-CNNDM/XSum、XSum-Faithfulness、Podcast Assessment、SummEval)上开展实验,结果表明,采用RACE数据集训练的MQAG模型在多数任务中优于现有评估方法。