We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
翻译:本文探讨了在文本摘要任务中,特别是观点摘要领域,对生成式人工智能(GenAI)进行更全面、更精确评估的必要性。传统方法利用自动化指标来比较从一系列观点性文本(例如产品评论)中生成的机器摘要,但由于大语言模型(LLM)带来的范式转变,这些方法已显示出局限性。本文通过提出一种新颖的、全自动的方法论来评估此类摘要的事实一致性,以应对这些不足。该方法基于测量给定摘要中的主张与原始评论中的主张之间的相似性,从而衡量生成摘要的覆盖度和一致性。为此,我们采用一种简单的方法从文本中提取事实性评估,然后进行比较并汇总为一个合适的分数。我们证明,所提出的度量指标能为相似的主张赋予更高的分数,无论该主张是否被否定、转述或扩展;并且与最先进的指标相比,该分数与人类判断具有高度相关性。