Despite the recent advances in abstractive text summarization, current summarization models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. We argue that the main reason for such behavior is that the summarization models trained with maximum likelihood objective assign high probability to plausible sequences given the context, but they often do not accurately rank sequences by their consistency. In this work, we solve this problem by calibrating the likelihood of model generated sequences to better align with a consistency metric measured by natural language inference (NLI) models. The human evaluation study and automatic metrics show that the calibrated models generate more consistent and higher-quality summaries. We also show that the models trained using our method return probabilities that are better aligned with the NLI scores, which significantly increase reliability of summarization models.
翻译:尽管近年来抽象式文本摘要取得了进展,目前的摘要模型仍会生成事实不一致的摘要,降低了其在实际应用中的效用。我们认为,导致这一行为的主要原因是,基于极大似然目标训练的摘要模型虽然能根据上下文为合理序列赋予高概率,但通常无法准确按一致性对序列进行排序。在本工作中,我们通过校准模型生成序列的似然度,使其更好地与自然语言推断(NLI)模型衡量的一致性指标对齐,从而解决了这一问题。人工评估研究和自动评估指标显示,经过校准的模型能生成更一致且质量更高的摘要。我们还表明,使用本方法训练的模型返回的概率与NLI得分更加对齐,显著提升了摘要模型的可靠性。