Factual inconsistency with source documents in automatically generated summaries can lead to misinformation or pose risks. Existing factual consistency(FC) metrics are constrained by their performance, efficiency, and explainability. Recent advances in Large language models (LLMs) have demonstrated remarkable potential in text evaluation but their effectiveness in assessing FC in summarisation remains underexplored. Prior research has mostly focused on proprietary LLMs, leaving essential factors that affect their assessment capabilities unexplored. Additionally, current FC evaluation benchmarks are restricted to news articles, casting doubt on the generality of the FC methods tested on them. In this paper, we first address the gap by introducing TreatFact a dataset of LLM-generated summaries of clinical texts, annotated for FC by domain experts. Moreover, we benchmark 11 LLMs for FC evaluation across news and clinical domains and analyse the impact of model size, prompts, pre-training and fine-tuning data. Our findings reveal that despite proprietary models prevailing on the task, open-source LLMs lag behind. Nevertheless, there is potential for enhancing the performance of open-source LLMs through increasing model size, expanding pre-training data, and developing well-curated fine-tuning data. Experiments on TreatFact suggest that both previous methods and LLM-based evaluators are unable to capture factual inconsistencies in clinical summaries, posing a new challenge for FC evaluation.
翻译:自动生成摘要与源文档之间的事实不一致可能导致错误信息或引发风险。现有的事实一致性(FC)指标在性能、效率和可解释性方面存在局限。大语言模型(LLM)的最新进展在文本评估方面展现出巨大潜力,但它们在评估摘要事实一致性方面的有效性仍有待深入探索。此前研究主要聚焦于商用大语言模型,忽略了影响其评估能力的关键因素。此外,当前FC评估基准仅限于新闻文章,这使基于此类基准测试的FC方法的普适性令人存疑。本文首先通过引入TreatFact数据集填补这一空白——该数据集包含领域专家标注的临床文本中LLM生成摘要的事实一致性。随后,我们以11个LLM为基准,在新闻和临床领域进行FC评估,并分析模型规模、提示词、预训练及微调数据的影响。研究结果表明,尽管商用模型在该任务中占据优势,但开源LLM仍存在差距。不过,通过扩大模型规模、扩展预训练数据以及开发精心整理的微调数据,开源LLM的性能提升仍有潜力。在TreatFact上的实验显示,既有方法和基于LLM的评估器均无法捕捉临床摘要中的事实不一致性,这为FC评估提出了新挑战。