Detecting factual errors in summaries has been an important and challenging subject in summarization research. Inspired by the emergent ability of large language models (LLMs), we explore evaluating factual consistency of summaries by directly prompting LLMs. We present a comprehensive empirical study to assess the ability of LLMs as factual consistency evaluators, which consists of (1) analyzing different LLMs such as the GPT model series and Flan-T5; (2) investigating a variety of prompting methods including vanilla prompting, chain-of-thought prompting, and a sentence-by-sentence prompting method to tackle long summaries; and (3) evaluating on diverse summaries generated by multiple summarization systems, ranging from pre-transformer methods to SOTA pretrained models. Our experiments demonstrate that prompting LLMs is able to outperform the previous best factuality systems in all settings, by up to 12.2 absolute points in terms of the binary classification accuracy on inconsistency detection.
翻译:检测摘要中的事实错误一直是摘要生成研究中重要且具有挑战性的课题。受大语言模型涌现能力的启发,我们探索通过直接提示大语言模型来评估摘要的事实一致性。我们开展了一项全面的实证研究,以评估大语言模型作为事实一致性评估器的能力,该研究包括:(1)分析不同大语言模型,如GPT模型系列和Flan-T5;(2)研究多种提示方法,包括标准提示、思维链提示以及一种逐句提示方法,以处理长摘要;(3)对由多种摘要生成系统生成的多样化摘要进行评估,涵盖从预Transformer方法到最新预训练模型。我们的实验表明,在所有设置下,提示大语言模型均能超越此前最佳的事实性系统,在基于二分类准确率的不一致性检测中,绝对分数提升高达12.2个百分点。