Large Language Models (LLMs), such as ChatGPT/GPT-4, have garnered widespread attention owing to their myriad of practical applications, yet their adoption has been constrained by issues of fact-conflicting hallucinations across web platforms. The assessment of factuality in text, produced by LLMs, remains inadequately explored, extending not only to the judgment of vanilla facts but also encompassing the evaluation of factual errors emerging in complex inferential tasks like multi-hop, and etc. In response, we introduce FactCHD, a fact-conflicting hallucination detection benchmark meticulously designed for LLMs. Functioning as a pivotal tool in evaluating factuality within "Query-Respons" contexts, our benchmark assimilates a large-scale dataset, encapsulating a broad spectrum of factuality patterns, such as vanilla, multi-hops, comparison, and set-operation patterns. A distinctive feature of our benchmark is its incorporation of fact-based chains of evidence, thereby facilitating comprehensive and conducive factual reasoning throughout the assessment process. We evaluate multiple LLMs, demonstrating the effectiveness of the benchmark and current methods fall short of faithfully detecting factual errors. Furthermore, we present TRUTH-TRIANGULATOR that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. The benchmark dataset and source code will be made available in https://github.com/zjunlp/FactCHD.
翻译:大型语言模型(LLMs),如ChatGPT/GPT-4,因其众多实际应用而受到广泛关注,但其在网络平台上的采用一直受到事实冲突幻觉问题的制约。由LLMs生成的文本的事实性评估仍未得到充分探索,这不仅涉及对简单事实的判断,还包括对多跳等复杂推理任务中出现的事实错误的评估。为此,我们引入了FactCHD,一个为LLMs精心设计的事实冲突幻觉检测基准。作为评估“查询-响应”语境中事实性的关键工具,我们的基准集成了一个大规模数据集,涵盖了广泛的事实性模式,如简单事实、多跳、比较和集合操作模式。该基准的一个独特之处在于其整合了基于事实的证据链,从而在整个评估过程中促进全面且有益的事实推理。我们评估了多个LLMs,展示了该基准的有效性,并指出现有方法在忠实检测事实错误方面仍有不足。此外,我们提出了TRUTH-TRIANGULATOR,它通过工具增强的ChatGPT和基于Llama2的LoRA微调来综合反思性考量,旨在通过预测结果与证据的结合产生更可信的检测。基准数据集和源代码将在https://github.com/zjunlp/FactCHD 提供。