Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ultimately improving the retrieval results in RAG systems. To address these issues, we propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content. This approach combines a retrieval filtering mechanism with a behavior- and outcome-driven reward strategy, optimizing both query formulation and retrieval outcomes. Extensive experiments demonstrate that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.
翻译:[translated abstract in Chinese]
检索增强生成(RAG)通过整合最新的外部知识增强了大型语言模型(LLM),然而真实世界的网络环境带来了独特的挑战。这些限制体现为两个关键难题:网络环境中普遍存在的错误信息,这些内容不可靠或具有误导性,可能降低检索准确性;以及网络工具的未充分利用——若能有效使用,这些工具本可以提升查询精度并帮助减轻噪音干扰,从而改善RAG系统的检索结果。为解决这些问题,我们提出WebFilter——一个新颖的RAG框架,该框架可生成受限来源查询并过滤不可靠内容。该方法将检索过滤机制与行为驱动及结果驱动的奖励策略相结合,同时优化查询构造与检索结果。大量实验表明,WebFilter提升了答案质量与检索精度,在域内和域外基准测试中均优于现有RAG方法。