Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from existing real-world user-LLM interaction datasets, including ShareGPT, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension and improvement of LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/Dianezzy/HaluEval-Wild.
翻译:幻觉问题对大型语言模型(LLMs)在关键领域的可靠性构成重大挑战。现有针对传统自然语言处理任务(如知识密集型问答和摘要)设计的LLM幻觉评估基准,难以捕捉动态真实场景中用户与LLM交互的复杂性。为解决这一不足,我们提出HaluEval-Wild——首个专为评估野外场景下LLM幻觉而设计的基准。我们从现有真实用户-LLM交互数据集(包括ShareGPT)中精心收集经过对抗性过滤(基于Alpaca)的挑战性用户查询,用于评估不同LLM的幻觉率。通过对所收集查询的分析,我们将其归类为五种不同类型,从而实现对LLM幻觉类型的细粒度分析,并利用强大的GPT-4模型与检索增强生成(RAG)技术合成参考答案。本基准为提升对LLM在真实交互场景中可靠性的理解与改进提供了新途径。基准代码与数据已发布于https://github.com/Dianezzy/HaluEval-Wild。