Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization.
翻译:近期文本摘要技术的进展,尤其是大语言模型(LLM)的出现,展现出卓越的性能。然而,一个显著挑战依然存在:大量自动生成的摘要存在事实不一致问题,例如幻觉。针对这一问题,涌现出多种用于评估摘要一致性的方法。但这些新引入的指标面临若干局限性,包括缺乏可解释性、仅聚焦短文档摘要(如新闻文章)以及计算不可行性,尤其对于基于LLM的指标。为克服这些缺陷,我们提出基于自然语言推理和声明抽取的摘要事实性评估(FENICE)——一种更具可解释性和高效性的事实导向指标。FENICE利用基于自然语言推理的对齐机制,将源文档信息与从摘要中抽取的一组原子事实(称为声明)进行匹配。我们的指标在事实性评估的权威基准AGGREFACT上达到了当前最佳水平。此外,我们还通过对长文本摘要进行人工标注,将评估扩展至更具挑战性的场景。