Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to dynamically integrate external knowledge during inference, improving their factual accuracy and adaptability. However, adversaries can inject poisoned external knowledge to override the model's internal memory. While existing attacks iteratively manipulate retrieval content or prompt structure of RAG, they largely ignore the model's internal representation dynamics and neuron-level sensitivities. The underlying mechanism of RAG poisoning has not been fully studied and the effect of knowledge conflict with strong parametric knowledge in RAG is not considered. In this work, we propose NeuroGenPoisoning, a novel attack framework that generates adversarial external knowledge in RAG guided by LLM internal neuron attribution and genetic optimization. Our method first identifies a set of Poison-Responsive Neurons whose activation strongly correlates with contextual poisoning knowledge. We then employ a genetic algorithm to evolve adversarial passages that maximally activate these neurons. Crucially, our framework enables massive-scale generation of effective poisoned RAG knowledge by identifying and reusing promising but initially unsuccessful external knowledge variants via observed attribution signals. At the same time, Poison-Responsive Neurons guided poisoning can effectively resolves knowledge conflict. Experimental results across models and datasets demonstrate consistently achieving high Population Overwrite Success Rate (POSR) of over 90% while preserving fluency. Empirical evidence shows that our method effectively resolves knowledge conflict.


翻译:检索增强生成(RAG)使大型语言模型(LLM)能够在推理过程中动态整合外部知识,从而提升其事实准确性与适应性。然而,攻击者可通过注入污染的外部知识来覆盖模型的内部记忆。现有攻击方法虽能迭代操纵RAG的检索内容或提示结构,却大多忽视了模型内部表征的动态特性与神经元层面的敏感性。RAG污染的内在机制尚未得到充分研究,且未考虑其与RAG中强参数化知识产生冲突的影响。本文提出NeuroGenPoisoning,一种基于LLM内部神经元归因与遗传优化的新型攻击框架,用于生成RAG中的对抗性外部知识。该方法首先识别一组"污染响应神经元",其激活强度与上下文污染知识高度相关;随后采用遗传算法进化对抗性文本段落,以最大化激活这些神经元。关键的是,本框架通过观测到的归因信号识别并复用具有潜力但初始未成功的知识变体,实现了大规模有效污染RAG知识的生成。同时,基于污染响应神经元的引导能有效化解知识冲突。跨模型与数据集的实验结果表明,该方法在保持文本流畅度的同时,持续实现超过90%的高群体覆盖成功率。实证证据表明,本方法能有效解决知识冲突问题。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员