We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness and argue that these are sub-optimal due to the granularity level considered for premises and hypotheses. That is, the smaller content unit considered as hypothesis is a sentence and premises are made up of a fixed number of document sentences. We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses. Departing from previous studies which focus on single short document summarisation, we analyse NLI based faithfulness evaluation for diverse summarisation tasks. We introduce DiverSumm, a new benchmark comprising long form summarisation (long documents and summaries) and diverse summarisation tasks (e.g., meeting and multi-document summarisation). In experiments, InFusE obtains superior performance across the different summarisation tasks. Our code and data are available at https://github.com/HJZnlp/infuse.
翻译:我们研究了利用现成的自然语言推理(NLI)模型评估摘要一致性的现有方法,并认为这些方法由于所考虑的前提和假设的粒度水平而并非最优。具体而言,被视作假设的最小内容单元是句子,而前提则由固定数量的文档句子构成。我们提出了一种新方法,即InFusE,该方法采用可变前提大小,并将摘要句子简化为更短的假设。与以往聚焦于单篇短文档摘要的研究不同,我们分析了基于NLI的一致性评估在多样化摘要任务中的应用。我们引入了DiverSumm,这是一个包含长文本摘要(长文档与摘要)和多样化摘要任务(如会议摘要与多文档摘要)的新基准。在实验中,InFusE在不同摘要任务中均取得了优越性能。我们的代码与数据已发布于https://github.com/HJZnlp/infuse。