Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.
翻译:大型语言模型可能生成与人类知识不一致的文本,导致事实性错误或"幻觉"。现有的大语言模型事实性评估研究通常使用语言模型提取事实主张,并基于预定义事实来源进行验证。然而,这些评估指标具有任务特异性且不可扩展,同时不同任务中事实来源的可替代性尚未得到充分探索。为解决上述挑战,我们将现有四种事实来源(人工撰写的证据、参考文献、搜索引擎结果和大语言模型知识)与包含六个代表性数据集的五项文本生成任务进行系统分类。在此基础上,我们提出UFO——一个基于大语言模型的统一灵活评估框架,可对即插即用的事实来源进行事实验证。基于该框架,我们实现了五种评估场景。实验结果表明:对于大多数问答任务,人工撰写的证据与参考文献至关重要,且两者在检索增强型问答任务中可相互替代;而在新闻事实生成任务中,搜索引擎结果与大语言模型知识不可或缺。我们的数据集与代码已开源至https://github.com/WaldenRUC/UFO。