Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

翻译：由大语言模型生成的知识图谱在需要知识密集型推理的检索增强生成应用中正变得日益重要。然而，现有的知识图谱抽取方法主要依赖于基于提示的方法，这些方法在处理大规模语料时效率低下。由于缺乏针对知识图谱构建的专门设计，这些方法常常存在信息丢失问题，尤其是在处理长文档时。此外，在无本体知识图谱构建方面，评估数据集和方法论也存在空白。为克服这些局限，我们提出了SynthKG，一个基于大语言模型的多步骤、文档级的无本体知识图谱合成工作流。通过在合成的文档-知识图谱对上微调一个更小的大语言模型，我们将多步骤流程简化为一个单步骤的知识图谱生成方法，称为Distill-SynthKG，从而大幅减少了所需的大语言模型推理调用次数。此外，我们重新利用现有的问答数据集来建立知识图谱评估数据集，并引入了新的评估指标。利用Distill-SynthKG生成的知识图谱，我们还为检索增强生成设计了一种新颖的基于图的检索框架。实验结果表明，Distill-SynthKG不仅在知识图谱质量上超越了所有基线模型——包括规模达其八倍的模型——而且在检索和问答任务中也持续表现出色。我们提出的图检索框架在多个基准数据集上也优于所有基于知识图谱的检索方法。我们公开发布了SynthKG数据集和Distill-SynthKG模型，以支持进一步的研究与开发。