Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.

翻译：检索增强生成（RAG）在缓解大型语言模型（LLM）的幻觉方面展现出显著能力。然而，当面临误导性或相互矛盾的证据时，尤其是在政治等现实领域——其中信息常呈现两极分化或被选择性构建——LLM难以保持一致的推理逻辑。主流的RAG基准测试通常在两种设定下评估模型：一是基于黄金标准文档的纯净检索环境，系统从中生成答案；二是基于人工注入噪声的合成扰动环境。这些假设未能反映真实世界条件，往往导致对RAG系统性能的高估。为填补这一空白，我们提出了RAGuard，首个用于评估RAG系统对误导性检索鲁棒性的基准测试。与以往依赖合成噪声的基准不同，我们的真实性核查数据集通过从Reddit讨论构建检索语料库，捕捉了自然产生的错误信息。该数据集将检索证据分为三类：支持性、误导性和无关性，为评估RAG系统如何处理不同类型证据提供了一个现实且具有挑战性的测试平台。实验结果表明，当面临潜在误导性检索时，所有测试的基于LLM的RAG系统表现均差于其零样本基线（即完全不进行检索），而人工标注者则始终表现更优，突显了LLM在噪声环境中的脆弱性。据我们所知，RAGuard是首个系统评估RAG对误导性证据鲁棒性的基准测试。我们期望该基准能推动未来研究超越理想化数据集，提升RAG系统的性能，使其在现实应用中更加可靠。数据集发布于https://huggingface.co/datasets/UCSC-IRKM/RAGuard。