LinearRAG：面向大规模语料的线性图检索增强生成 (LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora)

Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.

翻译：检索增强生成（RAG）通过利用外部知识，被广泛用于缓解大型语言模型（LLM）的幻觉问题。尽管传统RAG系统对简单查询有效，但在信息碎片化的大规模非结构化语料中表现不佳。近期研究引入知识图谱以捕捉关系结构，从而为复杂的多跳推理任务提供更全面的检索。然而，现有的基于图的RAG（GraphRAG）方法依赖不稳定且成本高昂的关系抽取进行图构建，常产生包含错误或不一致关系的噪声图，进而降低检索质量。本文重新审视现有GraphRAG系统的流程，提出LinearRAG（基于线性图的检索增强生成），这是一个实现可靠图构建与精确段落检索的高效框架。具体而言，LinearRAG仅通过轻量级实体抽取和语义链接，构建一种无关系层次图（称为Tri-Graph），避免了不稳定的关系建模。这种新的图构建范式随语料规模线性扩展，且不产生额外令牌消耗，为原始段落提供了经济可靠的索引。在检索方面，LinearRAG采用两阶段策略：（i）通过局部语义桥接激活相关实体，随后（ii）通过全局重要性聚合进行段落检索。在四个数据集上的大量实验表明，LinearRAG显著优于基线模型。我们的代码和数据集可在 https://github.com/DEEP-PolyU/LinearRAG 获取。