LinearRAG：面向大规模语料库的线性图检索增强生成 (LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora)

Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.

翻译：检索增强生成（RAG）通过利用外部知识，被广泛用于缓解大型语言模型（LLM）的幻觉问题。尽管对于简单查询有效，但传统RAG系统在处理大规模、非结构化语料库时面临挑战，因为信息往往分散且碎片化。近期研究引入知识图谱以捕捉关系结构，从而为复杂的多跳推理任务提供更全面的检索。然而，现有的基于图的RAG（GraphRAG）方法依赖于不稳定且成本高昂的关系提取来构建图，通常会产生包含错误或不一致关系的噪声图，从而降低检索质量。本文重新审视了现有GraphRAG系统的流程，并提出LinearRAG（基于线性图的检索增强生成），这是一个高效的框架，能够实现可靠的图构建和精确的段落检索。具体而言，LinearRAG仅通过轻量级实体提取和语义链接，构建了一种无关系层次图（称为Tri-Graph），避免了不稳定的关系建模。这种新的图构建范式与语料库规模呈线性扩展，且不产生额外的令牌消耗，为原始段落提供了经济可靠的索引。在检索方面，LinearRAG采用两阶段策略：（i）通过局部语义桥接激活相关实体，随后（ii）通过全局重要性聚合进行段落检索。在四个数据集上的大量实验表明，LinearRAG显著优于基线模型。