RAG vs. GraphRAG: A Systematic Evaluation and Key Insights

Retrieval-Augmented Generation (RAG) improves large language models (LLMs) by retrieving relevant information from external sources and has been widely adopted for text-based tasks. For structured data, such as knowledge graphs, Graph Retrieval-Augmented Generation (GraphRAG) retrieves and aggregates information along graph structures. More recently, GraphRAG has been extended to general text settings by organizing unstructured text into graph representations, showing promise for reasoning and grounding. Despite these advances, existing GraphRAG systems for text data are often tailored to specific tasks, datasets, and system designs, resulting in heterogeneous evaluation protocols. Consequently, a systematic understanding of the relative strengths, limitations, and trade-offs between RAG and GraphRAG on widely used text benchmarks remains limited. In this paper, we present a comprehensive benchmark study comparing RAG and GraphRAG on established text-based tasks, including question answering and query-based summarization. We introduce a unified evaluation protocol that standardizes data preprocessing, retrieval configurations, and generation settings, enabling fair and reproducible comparisons. Our results highlight the distinct strengths of RAG and GraphRAG across different tasks and evaluation perspectives. Building on these findings, we explore selection and integration strategies that combine the strengths of both paradigms, leading to consistent performance improvements. We further analyze failure modes, efficiency trade-offs, and evaluation biases, and highlight key considerations for designing and evaluating retrieval-augmented generation systems.

翻译：检索增强生成（RAG）通过从外部源检索相关信息来改进大语言模型（LLM），已在基于文本的任务中得到广泛应用。对于结构化数据（如知识图谱），图检索增强生成（GraphRAG）则沿图结构检索并聚合信息。最近，GraphRAG 通过将非结构化文本组织成图表示，进一步扩展到通用文本场景，在推理与事实基础方面展现出潜力。尽管取得了这些进展，现有面向文本数据的 GraphRAG 系统通常针对特定任务、数据集和系统设计进行定制，导致评估方案存在异质性。因此，对于 RAG 与 GraphRAG 在广泛使用的文本基准测试中的相对优势、局限性和权衡取舍，目前仍缺乏系统性的理解。本文在成熟的基于文本的任务（包括问答和基于查询的摘要生成）上，对 RAG 与 GraphRAG 进行了全面的基准比较研究。我们提出了一种统一的评估方案，标准化了数据预处理、检索配置和生成设置，从而实现了公平且可复现的比较。我们的结果凸显了 RAG 与 GraphRAG 在不同任务和评估维度上的各自优势。基于这些发现，我们探索了结合两种范式优势的选择与集成策略，从而实现了持续的性能提升。我们进一步分析了失败模式、效率权衡和评估偏差，并强调了设计与评估检索增强生成系统的关键考量因素。