Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.
翻译:幻觉对大型语言模型(LLM)在关键领域中的可靠性构成了重大挑战。近期旨在评估LLM在传统自然语言处理(NLP)任务(如知识密集型问答和摘要生成)中幻觉的基准测试,不足以捕捉动态、真实世界场景下用户与LLM交互的复杂性。为弥补这一不足,我们引入了HaluEval-Wild,这是首个专门设计用于评估LLM在真实场景中幻觉的基准测试。我们细致地从ShareGPT(一个现有的真实世界用户-LLM交互数据集)中收集具有挑战性的用户查询(经由Alpaca对抗性过滤),用以评估各种LLM的幻觉率。在分析所收集的查询后,我们将其归类为五种不同的类型,从而能够对LLM表现出的幻觉类型进行细粒度分析,并利用强大的GPT-4模型和检索增强生成(RAG)技术合成参考答案。我们的基准测试为增强我们对真实世界交互场景中LLM可靠性的理解与改进提供了一种新颖的途径。我们的基准测试可在 https://github.com/HaluEval-Wild/HaluEval-Wild 获取。