In recent years, Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities, surpassing those seen in earlier language models. A particularly intriguing application of LLMs is their role as evaluators for texts produced by various generative models. In this study, we delve into the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models. Initially, we introduce an innovative approach for factuality assessment using LLMs. This entails employing a singular LLM for the entirety of the question-answering-based factuality scoring process. Following this, we examine the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations. Contrary to initial expectations, our results indicate a lack of significant correlations between factuality metrics and human evaluations, specifically for GPT-4 and PaLM-2. Notable correlations were only observed with GPT-3.5 across two factuality subcategories. These consistent findings across various factual error categories suggest a fundamental limitation in the current LLMs' capability to accurately gauge factuality. This version presents the information more concisely while maintaining the main points and findings of the original text.
翻译:近年来,大型语言模型(LLM)因其显著的涌现能力而备受关注,其能力超越了早期语言模型的水平。LLM一个特别引人注目的应用是作为评估者,对各种生成模型产出的文本进行评价。在本研究中,我们深入探讨了LLM作为文本生成模型所生成摘要的事实一致性可靠评估者的潜力。首先,我们提出了一种利用LLM进行事实性评估的创新方法。该方法在整个基于问答的事实性评分过程中仅使用单一LLM。随后,我们检验了不同LLM在直接事实性评分中的有效性,并将其与传统指标及人工标注进行了基准对比。与初步预期相反,我们的结果表明,事实性指标与人类评估之间缺乏显著相关性,特别是针对GPT-4和PaLM-2。仅在GPT-3.5上,我们观察到两个事实性子类别存在显著相关性。这些在各种事实性错误类别中一致发现的结论表明,当前LLM在准确衡量事实性方面存在根本性局限。此版本更简洁地呈现了信息,同时保留了原始文本的主要观点和发现。