Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.
翻译:检索增强生成(RAG)已成为通过引入外部信息来提升大语言模型(LLM)准确性的关键技术。随着支持上下文长度日益增长的LLM的出现,学界愈发关注这些模型在RAG场景中的表现。这些新型长上下文模型能否提升RAG性能?本文针对20个主流开源及商业LLM,系统研究了上下文长度增加对RAG性能的影响。我们在三个特定领域数据集上运行RAG工作流,将总上下文长度从2,000逐步增加至128,000个词元(并在可能条件下扩展至200万词元),进而揭示了长上下文在RAG应用中的优势与局限。研究发现:虽然检索更多文档能够提升性能,但仅有少数最新前沿LLM能在超过64k词元的长上下文中保持稳定的准确性。我们还识别了长上下文场景中的典型失效模式,为未来研究方向提供了参考。