Benchmarking Large Language Models for Knowledge Graph Validation

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

翻译：知识图谱通过关系链接实体来存储结构化事实知识，对许多应用至关重要。这些应用依赖于知识图谱的事实准确性，因此验证事实至关重要，但具有挑战性。专家手动验证是理想方法，但在大规模场景下不切实际。自动化方法显示出潜力，但尚未准备好应用于现实世界知识图谱。大语言模型凭借其语义理解和知识获取能力展现出潜力，然而它们对于知识图谱事实验证的适用性和有效性在很大程度上仍未得到探索。本文介绍了FactCheck，一个旨在从三个关键维度评估大语言模型进行知识图谱事实验证能力的基准：(1) 大语言模型内部知识；(2) 通过检索增强生成获取外部证据；(3) 采用多模型共识策略的聚合知识。我们在三个不同的现实世界知识图谱上评估了开源和商业大语言模型。FactCheck还包含一个专为知识图谱事实验证定制的、包含200多万份文档的检索增强生成数据集。此外，我们提供了一个用于分析验证决策的交互式探索平台。实验分析表明，尽管大语言模型产生了有希望的结果，但它们仍然不够稳定和可靠，无法用于现实世界的知识图谱验证场景。通过检索增强生成方法整合外部证据会导致性能波动，相较于更简化的方法提供了不一致的改进——且计算成本更高。同样，基于多模型共识的策略并不能持续优于单个模型，这突显了缺乏通用解决方案的问题。这些发现进一步强调了需要像FactCheck这样的基准来系统评估并推动这一困难但关键任务的进展。