The Power of Noise: Redefining Retrieval for RAG Systems

Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional Large Language Models (LLMs). RAG systems enhance their generation ability by incorporating external data retrieved through an Information Retrieval (IR) phase, overcoming the limitations of standard LLMs, which are restricted to their pre-trained knowledge and limited context window. Most research in this area has predominantly concentrated on the generative aspect of LLMs within RAG systems. Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality. These results underscore the need for developing specialized strategies to integrate retrieval with language generation models, thereby laying the groundwork for future research in this field.

翻译：检索增强生成（RAG）系统相较于传统大语言模型（LLMs）实现了重大进步。RAG系统通过整合信息检索（IR）阶段获取的外部数据来增强生成能力，突破了标准LLMs仅依赖预训练知识与有限上下文窗口的局限性。现有研究大多聚焦于RAG系统中LLM的生成层面，而本研究通过系统批判地剖析IR组件对RAG系统的影响填补了这一空白。本文分析了检索器应具备哪些特性才能有效构建RAG提示，重点研究了应检索的文档类型。我们评估了多个要素，包括文档与提示的相关性、文档位置以及上下文包含的文档数量。研究发现（包括其他重要发现）：引入不相关文档反能使准确率意外提升30%以上，这与我们最初关于质量下降的假设相悖。这些结果凸显了开发检索与语言生成模型协同策略的必要性，为该领域的未来研究奠定了基础。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日