While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild. These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search. Notably, half of these real-world entities do not have associated Wikipedia pages. We evaluate 118,785 generations from 15 LLMs on 7,919 entities. We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains. Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations.
翻译:尽管大语言模型(LLM)的幻觉问题是一个主要挑战,但现有的事实性评估基准未能覆盖LLM真实用户所查询的多样化知识领域。为弥补这一差距,我们引入了WildHallucinations基准,用于评估事实性。该基准通过提示LLM生成关于从真实用户与聊天机器人对话中挖掘出的实体的信息来实现评估。这些生成内容随后会依据从网络搜索中系统收集并整理的知识源进行自动事实核查。值得注意的是,这些真实世界实体中有一半没有相关的维基百科页面。我们在7,919个实体上评估了来自15个LLM的118,785条生成内容。我们发现,LLM在没有维基百科页面的实体上持续产生更多幻觉,并且在不同领域表现出不同的幻觉率。最后,对于相同的基座模型,添加检索组件仅能略微减少幻觉,但无法完全消除幻觉。