After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
翻译:随着大型语言模型(LLM)的引入,自然语言生成(NLG)任务(包括文本摘要和机器翻译)的性能得到了显著提升。然而,LLM仍然会产生包含幻觉的输出,即内容缺乏事实依据。因此,开发评估LLM事实性的方法变得尤为迫切。事实上,用于事实性评估的资源近期已开始涌现。尽管具有挑战性,但这些资源面临以下一个或多个局限:(i)它们针对特定任务或领域定制;(ii)规模有限,从而阻碍了新的事实性评估器的训练;(iii)它们是为更简单的验证任务(如主张验证)设计的。为解决这些问题,我们提出了LLM-Oasis,据我们所知,这是用于训练端到端事实性评估器的最大资源。LLM-Oasis通过从维基百科中提取主张、伪造其中一部分主张,并生成事实性与非事实性文本对来构建。随后,我们依靠人工标注员验证数据集质量,并创建一个用于基准测试事实性评估系统的黄金标准测试集。实验表明,LLM-Oasis对当前最先进的LLM构成了显著挑战,其中GPT-4o在我们提出的端到端事实性评估任务中最高仅达到60%的准确率,这凸显了其推动该领域未来研究的潜力。