Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}.
翻译:将外部知识有效整合进大语言模型(LLMs)对于提升其能力并满足现实需求至关重要。检索增强生成(RAG)通过检索最相关片段注入LLMs,为实现这一目标提供了有效方法。然而,LLMs上下文窗口尺寸的进展提供了另一种替代方案,这引发了在处理外部知识时RAG是否仍然必要的疑问。现有若干研究对RAG与长上下文(LC)LLMs的比较未能得出明确结论,这主要源于基准测试设计的局限性。本文提出LaRA,一个专门设计用于严格比较RAG与LC LLMs的新型基准测试。LaRA涵盖四个实际问答任务类别和三种自然长文本类型,共计2326个测试用例。通过对七个开源和四个专有LLMs进行系统评估,我们发现RAG与LC之间的最优选择取决于多种因素的复杂相互作用,包括模型的参数量、长文本处理能力、上下文长度、任务类型以及检索片段的特征。我们的研究结果为从业者在开发和部署LLM应用时有效利用RAG与LC方法提供了可操作的指导准则。我们的代码与数据集发布于:\href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}。