HalluVault: A Novel Logic Programming-aided Metamorphic Testing Framework for Detecting Fact-Conflicting Hallucinations in Large Language Models

Large language models (LLMs) have transformed the landscape of language processing, yet struggle with significant challenges in terms of security, privacy, and the generation of seemingly coherent but factually inaccurate outputs, commonly referred to as hallucinations. Among these challenges, one particularly pressing issue is Fact-Conflicting Hallucination (FCH), where LLMs generate content that directly contradicts established facts. Tackling FCH poses a formidable task due to two primary obstacles: Firstly, automating the construction and updating of benchmark datasets is challenging, as current methods rely on static benchmarks that don't cover the diverse range of FCH scenarios. Secondly, validating LLM outputs' reasoning process is inherently complex, especially with intricate logical relations involved. In addressing these obstacles, we propose an innovative approach leveraging logic programming to enhance metamorphic testing for detecting Fact-Conflicting Hallucinations (FCH). Our method gathers data from sources like Wikipedia, expands it with logical reasoning to create diverse test cases, assesses LLMs through structured prompts, and validates their coherence using semantic-aware assessment mechanisms. Our method generates test cases and detects hallucinations across six different LLMs spanning nine domains, revealing hallucination rates ranging from 24.7% to 59.8%. Key observations indicate that LLMs encounter challenges, particularly with temporal concepts, handling out-of-distribution knowledge, and exhibiting deficiencies in logical reasoning capabilities. The outcomes underscore the efficacy of logic-based test cases generated by our tool in both triggering and identifying hallucinations. These findings underscore the imperative for ongoing collaborative endeavors within the community to detect and address LLM hallucinations.

翻译：大语言模型（LLMs）已深刻改变了语言处理的格局，但在安全性、隐私性以及生成看似连贯实则事实不准确的输出（即所谓的“幻觉”）方面仍面临重大挑战。在这些挑战中，一个尤为紧迫的问题是“事实冲突幻觉”（FCH），即LLMs生成与既定事实直接相悖的内容。应对FCH面临两大主要障碍：首先，基准数据集的自动构建与更新极具挑战性，当前方法依赖无法覆盖多样化FCH场景的静态基准。其次，验证LLM输出的推理过程（尤其是涉及复杂逻辑关系时）本质上极为复杂。为解决这些障碍，我们提出了一种创新方法，利用逻辑编程增强蜕变测试以检测事实冲突幻觉。该方法从维基百科等来源收集数据，通过逻辑推理扩展数据以生成多样化测试用例，通过结构化提示评估LLMs，并利用语义感知评估机制验证其输出一致性。我们在六个不同LLMs上跨越九个领域生成测试用例并检测幻觉，发现幻觉率介于24.7%到59.8%之间。关键观察表明，LLMs在时间概念处理、分布外知识应用及逻辑推理能力方面存在显著不足。实验结果凸显了我们工具生成的基于逻辑的测试用例在触发和识别幻觉方面的有效性，强调了社区需持续协作以检测并解决LLM幻觉问题。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日