Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.
翻译:从图数据中进行检索对于增强大型语言模型(LLM)的开放域知识和私有企业数据至关重要,同时也是近期GraphRAG系统(edge等人,2024)的核心组件。尽管知识图谱和知识库问答领域已有数十年的研究,主流LLM框架(如Langchain和LlamaIndex)对从现代百科全书式知识图谱(如Wikidata)进行检索的支持仍极为有限。本文分析了其根本原因,指出现代RDF知识图谱(如Wikidata、Freebase)对LLM效率较低的原因在于:模式规模过大远超典型LLM上下文窗口、资源标识符的使用、关系类型重叠以及缺乏规范化。作为解决方案,我们提出在底层RDF图谱之上构建属性图视图,使LLM能够通过Cypher语言进行高效查询。我们在Wikidata上实现了这一构想,并推出了CypherBench——首个包含11个大规模跨领域属性图谱的基准测试集,涵盖780万个实体和超过10,000个问题。为实现这一目标,我们攻克了若干关键挑战,包括开发RDF到属性图的转换引擎、构建文本到Cypher任务的系统化生成流程,以及设计新的评估指标。