Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research evaluating whether existing automatic evaluation metrics are fit for purpose when applied to long document data sets. In this work, we evaluate the efficacy of automatic metrics at assessing factual consistency in long document text summarisation and propose a new evaluation framework LongDocFACTScore. This framework allows metrics to be extended to any length document. This framework outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. Furthermore, we show LongDocFACTScore has performance comparable to state-of-the-art metrics when evaluated against human measures of factual consistency on short document data sets. We make our code and annotated data publicly available: https://github.com/jbshp/LongDocFACTScore.
翻译:保持事实一致性是抽象式文本摘要中的一个关键问题,然而传统用于评估文本摘要的自动指标(如ROUGE评分)无法对此进行评估。近年来的研究致力于利用预训练语言模型开发改进的指标来衡量事实一致性,但这些指标存在限制性的词元长度约束,因此不适用于评估长文档文本摘要。此外,关于现有自动评估指标在应用于长文档数据集时是否适用,相关研究有限。本研究评估了自动指标在长文档文本摘要中衡量事实一致性的有效性,并提出一个新的评估框架LongDocFACTScore。该框架允许将指标扩展至任意长度的文档。在评估长文档摘要数据集时,该框架与人类事实性度量指标的相关性优于现有最先进指标。此外,我们证明在短文档数据集上评估人类事实一致性度量时,LongDocFACTScore的性能与最先进指标相当。我们将代码和注释数据公开提供:https://github.com/jbshp/LongDocFACTScore。