UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.

翻译：大型语言模型可能生成与人类知识不一致的文本，导致事实性错误或"幻觉"。现有的大语言模型事实性评估研究通常使用语言模型提取事实主张，并基于预定义事实来源进行验证。然而，这些评估指标具有任务特异性且不可扩展，同时不同任务中事实来源的可替代性尚未得到充分探索。为解决上述挑战，我们将现有四种事实来源（人工撰写的证据、参考文献、搜索引擎结果和大语言模型知识）与包含六个代表性数据集的五项文本生成任务进行系统分类。在此基础上，我们提出UFO——一个基于大语言模型的统一灵活评估框架，可对即插即用的事实来源进行事实验证。基于该框架，我们实现了五种评估场景。实验结果表明：对于大多数问答任务，人工撰写的证据与参考文献至关重要，且两者在检索增强型问答任务中可相互替代；而在新闻事实生成任务中，搜索引擎结果与大语言模型知识不可或缺。我们的数据集与代码已开源至https://github.com/WaldenRUC/UFO。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日