Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.
翻译:大语言模型(LLMs)具备生成具有说服力且易于理解的语言的卓越能力。然而,语言的连贯性并不等同于真实性,其生成的回答常常包含细微的幻觉。现有基准测试受限于静态且狭窄的问题设计,导致评估覆盖范围有限且可能产生误导性结果。本文提出KGHaluBench,一个基于知识图谱的幻觉基准测试,用于从知识广度和深度两个维度评估大语言模型,从而为LLM的真实性提供更公平、更全面的洞察。我们的框架利用知识图谱动态构建具有挑战性的多层面问题,并通过统计方法估计问题难度以解决流行度偏差。我们构建的自动化验证流程能够检测模型的回避行为,并在概念层面和正确性层面验证LLM的回答,从而识别不同类型的幻觉。我们使用新颖的准确率和幻觉指标评估了25个前沿模型。实验结果从模型知识构成的角度,为不同规模模型产生幻觉的成因提供了更具可解释性的洞察。KGHaluBench已公开提供,以支持未来在缓解幻觉方面的研究发展。