The Power of Noise: Redefining Retrieval for RAG Systems

Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional Large Language Models (LLMs). RAG systems enhance their generation ability by incorporating external data retrieved through an Information Retrieval (IR) phase, overcoming the limitations of standard LLMs, which are restricted to their pre-trained knowledge and limited context window. Most research in this area has predominantly concentrated on the generative aspect of LLMs within RAG systems. Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality. These results underscore the need for developing specialized strategies to integrate retrieval with language generation models, thereby laying the groundwork for future research in this field.

翻译：检索增强生成（RAG）系统相较于传统大语言模型（LLMs）取得了显著进步。RAG系统通过引入信息检索（IR）阶段获取外部数据来增强生成能力，克服了标准LLMs仅依赖预训练知识和有限上下文窗口的局限性。现有研究大多聚焦于RAG系统中LLMs的生成层面。本研究通过深入批判性分析IR组件对RAG系统的影响，填补了这一研究空白。本文系统分析了检索器应具备何种特征以实现有效的RAG提示构建，重点关注需检索的文档类型。我们评估了文档与提示的相关性、文档位置及上下文包含的文档数量等多个要素。研究发现表明：引入不相关文档竟能意外地将准确率提升超过30%，这与我们最初认为会降低性能的假设相悖。这些结果凸显了开发检索与语言生成模型集成专用策略的必要性，为该领域的未来研究奠定了基础。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日