Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from existing real-world user-LLM interaction datasets, including ShareGPT, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension and improvement of LLM reliability in scenarios reflective of real-world interactions.
翻译:幻觉对大型语言模型(LLMs)在关键领域中的可靠性构成了重大挑战。现有的旨在评估LLM幻觉的基准测试主要局限于知识密集型问答(QA)和摘要等传统自然语言处理任务,难以捕捉动态真实场景中用户与LLM交互的复杂性。为填补这一空白,我们提出了HaluEval-Wild,这是首个专门用于评估真实场景中LLM幻觉的基准测试。我们从现有的真实用户-LLM交互数据集(包括ShareGPT)中精心筛选出具有挑战性(经Alpaca对抗性过滤)的用户查询,以评估多种LLM的幻觉率。通过对收集到的查询进行分析,我们将其划分为五种不同类型,从而实现对LLM所表现幻觉类型的细粒度分析,并借助强大的GPT-4模型和检索增强生成(RAG)技术合成参考答案。该基准测试为提升我们在真实交互场景下理解并改进LLM可靠性提供了新方法。