ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals -- like document ranking -- and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO. Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents. Our first contribution adopts a graph-theoretic perspective to identify a "consistent majority" among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents. We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.

翻译：检索增强生成（RAG）通过将大型语言模型的输出锚定于外部文档来增强其性能。然而，此类系统仍易受到检索语料库攻击（如提示注入）的影响。基于RAG的搜索系统（例如Google的搜索AI概览）为研究和防御此类威胁提供了一个有趣的场景，因为防御算法可以利用内置的可靠性信号（如文档排序），并且由于数十年来为挫败搜索引擎优化所做的努力，对攻击者而言构成了一个非LLM层面的挑战。受此场景启发但不限于此，本文提出了可靠性RAG，这是一个明确利用检索文档可靠性信息的对抗鲁棒性框架。我们的第一个贡献是从图论视角出发，通过在检索文档中识别“一致多数”来过滤恶意文档。我们引入了一种基于在文档图上求解最大独立集（MIS）的新算法，其中边表示文档间的矛盾关系。我们的MIS变体明确优先考虑高可靠性文档，并在自然假设下为有限对抗性篡改提供了可证明的鲁棒性保证。考虑到精确求解MIS在大规模检索集上的计算成本，我们的第二个贡献是一个可扩展的加权采样与聚合框架。该框架明确利用可靠性信息，在高效处理大量文档的同时保留了部分鲁棒性保证。我们提供的实证结果表明，与现有方法相比，可靠性RAG在面对对抗性攻击时展现出更优的鲁棒性，保持了较高的良性准确率，并在长文本生成任务中表现出色——而先前专注于鲁棒性的方法在此类任务中往往表现不佳。我们的工作是朝着为RAG中检索语料库篡改问题构建更有效、可证明鲁棒防御迈出的重要一步。