Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
翻译:检索增强生成已成为知识密集型问答的关键范式。然而,现有的多跳RAG系统效率低下,因为它们在每一步交替进行检索与推理,导致重复的大语言模型调用、高昂的令牌消耗以及跨跳间不稳定的实体定位。我们提出了CompactRAG,一个简单而有效的框架,将离线语料库重构与在线推理解耦。在离线阶段,一个大语言模型一次性读取语料库并将其转换为原子化的问答知识库,该知识库以最小化、细粒度的问答对形式表示知识。在线阶段,复杂查询被分解并经过精心重写以保持实体一致性,随后通过密集检索和基于RoBERTa的答案提取来解决。值得注意的是,在推理过程中,无论推理跳数多少,大语言模型总共仅被调用两次——一次用于子问题分解,一次用于最终答案合成。在HotpotQA、2WikiMultiHopQA和MuSiQue上的实验表明,与迭代式RAG基线相比,CompactRAG在显著降低令牌消耗的同时实现了具有竞争力的准确率,凸显了在大规模知识语料库上进行多跳推理的一种高成本效益且实用的方法。实现代码已在GitHub上开源。