SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

翻译：检索增强生成（RAG）在缓解大语言模型幻觉的同时，引入了一个关键漏洞：语料库完整性。我们提出SilentRetrieval，一种两阶段数据投毒攻击方法，通过对抗性构造且语义流畅的文档劫持RAG系统。第一阶段采用协调束搜索（Coordinated Beam Search），这是一种基于流畅性-相似性目标的多元联合优化方法，在约束困惑度的同时保持中毒宿主文档的可检索性。第二阶段使用上下文自适应触发器生成（Context-Adaptive Trigger Generation），这是一种由冻结大语言模型驱动的轻量级触发器融合步骤，将操控触发器集成到文档内容中。在单中毒文档-单查询评估框架下，结合合成目标答案，SilentRetrieval在Natural Questions和MS MARCO数据集上分别达到84.6%/81.3%的HR@10和57.5%/54.8%的ASR-LLM，同时保持接近良性文档的困惑度。跨模型评估表明，在固定触发器生成器条件下，该攻击对四种目标大语言模型均具有显著有效性；针对包括ColBERT和商业嵌入模型在内的未知检索器进行的迁移测试中，在相同注入语料库协议下平均HR@10达到64.7%。在维基百科规模的采样评估中，SilentRetrieval以0.016%的投毒比例仍保持74.2%的HR@10。结合检索侧与生成侧的双重防御可显著降低攻击成功率，但会引入延迟权衡。人工评估显示，当前样本量下其标记率显著低于不流畅基线，但数值上仍比良性内容更易引发怀疑。