Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, \ie content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HaluEval) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, \ie sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (\ie about $11.4\%$ user queries). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. While, our experiments also prove that the hallucination recognition can be improved by providing external knowledge or adding reasoning steps. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.
翻译:大型语言模型(如ChatGPT)容易产生幻觉,即生成与来源冲突或无法通过事实知识验证的内容。为理解大型语言模型在何种内容类型及程度上容易产生幻觉,我们提出了面向大型语言模型的幻觉评估基准(HaluEval),这是一个包含大规模生成与人工标注的幻觉样本的评估数据集,用于评测语言模型识别幻觉的能力。为生成这些样本,我们提出了一种基于ChatGPT的两步框架,即采样-过滤策略。此外,我们雇佣了人类标注员对ChatGPT响应中的幻觉进行标注。实验结果表明,ChatGPT倾向于在特定主题下通过编造不可验证信息(约占用户查询的11.4%)生成幻觉内容。当前主流语言模型在识别文本中的幻觉时面临巨大挑战。我们的实验同时证明,通过提供外部知识或增加推理步骤,可有效改善幻觉识别能力。本基准可在https://github.com/RUCAIBox/HaluEval 获取。