Large language models (LLMs) have transformed the landscape of language processing, yet struggle with significant challenges in terms of security, privacy, and the generation of seemingly coherent but factually inaccurate outputs, commonly referred to as hallucinations. Among these challenges, one particularly pressing issue is Fact-Conflicting Hallucination (FCH), where LLMs generate content that directly contradicts established facts. Tackling FCH poses a formidable task due to two primary obstacles: Firstly, automating the construction and updating of benchmark datasets is challenging, as current methods rely on static benchmarks that don't cover the diverse range of FCH scenarios. Secondly, validating LLM outputs' reasoning process is inherently complex, especially with intricate logical relations involved. In addressing these obstacles, we propose an innovative approach leveraging logic programming to enhance metamorphic testing for detecting Fact-Conflicting Hallucinations (FCH). Our method gathers data from sources like Wikipedia, expands it with logical reasoning to create diverse test cases, assesses LLMs through structured prompts, and validates their coherence using semantic-aware assessment mechanisms. Our method generates test cases and detects hallucinations across six different LLMs spanning nine domains, revealing hallucination rates ranging from 24.7% to 59.8%. Key observations indicate that LLMs encounter challenges, particularly with temporal concepts, handling out-of-distribution knowledge, and exhibiting deficiencies in logical reasoning capabilities. The outcomes underscore the efficacy of logic-based test cases generated by our tool in both triggering and identifying hallucinations. These findings underscore the imperative for ongoing collaborative endeavors within the community to detect and address LLM hallucinations.
翻译:大语言模型(LLMs)已深刻改变了语言处理的格局,但在安全性、隐私性以及生成看似连贯实则事实不准确的输出(即所谓的“幻觉”)方面仍面临重大挑战。在这些挑战中,一个尤为紧迫的问题是“事实冲突幻觉”(FCH),即LLMs生成与既定事实直接相悖的内容。应对FCH面临两大主要障碍:首先,基准数据集的自动构建与更新极具挑战性,当前方法依赖无法覆盖多样化FCH场景的静态基准。其次,验证LLM输出的推理过程(尤其是涉及复杂逻辑关系时)本质上极为复杂。为解决这些障碍,我们提出了一种创新方法,利用逻辑编程增强蜕变测试以检测事实冲突幻觉。该方法从维基百科等来源收集数据,通过逻辑推理扩展数据以生成多样化测试用例,通过结构化提示评估LLMs,并利用语义感知评估机制验证其输出一致性。我们在六个不同LLMs上跨越九个领域生成测试用例并检测幻觉,发现幻觉率介于24.7%到59.8%之间。关键观察表明,LLMs在时间概念处理、分布外知识应用及逻辑推理能力方面存在显著不足。实验结果凸显了我们工具生成的基于逻辑的测试用例在触发和识别幻觉方面的有效性,强调了社区需持续协作以检测并解决LLM幻觉问题。