In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
翻译:在传统的检索增强生成(RAG)框架中,基本检索单元通常较短。常见的检索器(如DPR)通常处理约100词的维基百科段落。这种设计迫使检索器在庞大的语料库中搜索以找到“针尖”般的单元。相比之下,阅读器只需从检索到的短单元中提取答案。这种“重”检索器与“轻”阅读器的不平衡设计可能导致次优性能。为了缓解这种不平衡,我们提出了一个新框架LongRAG,它由一个“长检索器”和一个“长阅读器”组成。LongRAG将整个维基百科处理为4K词元的单元,长度是之前的30倍。通过增加单元大小,我们将总单元数从2200万显著减少到70万。这极大地减轻了检索器的负担,从而实现了显著的检索分数:在NQ数据集上,答案召回率@1达到71%(此前为52%);在HotpotQA(全维基)数据集上,答案召回率@2达到72%(此前为47%)。随后,我们将检索到的前k个单元(约30K词元)输入现有的长上下文大语言模型(LLM)进行零样本答案提取。无需任何训练,LongRAG在NQ上实现了62.7%的精确匹配率(EM),这是目前已知的最佳结果。LongRAG在HotpotQA(全维基)上也达到了64.3%的EM,与当前最先进(SoTA)模型持平。我们的研究为未来将RAG与长上下文LLM结合的路线图提供了见解。