The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a na\"ive RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.
翻译:检索增强生成(RAG)通过从外部知识源检索相关信息,使大语言模型(LLM)能够回答关于私有和/或未见文档集合的问题。然而,针对整个文本语料库的全局性问题(例如"数据集中的主要主题是什么?"),RAG方法会失效,因为这本质上是一个查询聚焦式摘要(QFS)任务,而非显式检索任务。与此同时,现有的QFS方法无法扩展至典型RAG系统所索引的文本规模。为融合这两种方法各自的优势,我们提出了一种适用于私有文本语料库问答的图RAG方法,该方法在用户问题的通用性与待索引源文本的规模两方面均具备可扩展性。我们的方法分两阶段使用LLM构建基于图的文本索引:首先从源文档推导出实体知识图谱,随后为所有紧密相关实体组预生成社区摘要。给定一个问题时,每个社区摘要被用于生成部分响应,最后将所有部分响应汇总为最终用户响应。对于百万级token数据集的全局意义构建类问题,我们证明图RAG在生成答案的全面性和多样性方面均显著优于朴素RAG基线。全局和局部图RAG方法的开源Python实现即将在https://aka.ms/graphrag发布。