State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. Hence, an important component of assessing the quality of summaries is to determine whether there is information consistency between the source and the summary. Existing approaches are typically based on lexical matching or representation-based methods. In this work, we introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared. We propose a Multiple-choice Question Answering and Generation framework, MQAG, which approximates the information consistency by computing the expected statistical distance between summary and source answer distributions over automatically generated multiple-choice questions. This approach exploits multiple-choice answer probabilities, as predicted answer distributions can be compared. We conduct experiments on four summary evaluation datasets: QAG-CNNDM/XSum, XSum-Hallucination, Podcast Assessment, and SummEval. Experiments show that MQAG, using models trained on SQuAD or RACE, outperforms existing evaluation methods on the majority of tasks.
翻译:当前最先进的摘要系统能够生成高度流畅的摘要,但这些摘要可能包含事实不一致或源文本中不存在的信息。因此,评估摘要质量的重要环节在于判断源文本与摘要之间的信息一致性。现有方法通常基于词汇匹配或表征学习实现。本研究提出一种基于标准信息论测度的替代方案,通过直接比较源文本与摘要中的信息分布。我们构建了MQAG(多项选择问答与生成框架),该框架基于自动生成的多项选择问题,通过计算摘要与源文本答案分布之间的期望统计距离来近似信息一致性。该方法利用多项选择答案概率实现预测分布的比较,在四个摘要评估数据集(QAG-CNNDM/XSum、XSum-Hallucination、Podcast Assessment和SummEval)上的实验表明,采用SQuAD或RACE数据集训练的MQAG模型在多数任务中优于现有评估方法。