Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

Retrieval-augmented generation (RAG) improves factual grounding by conditioning large language models on retrieved evidence, but it also opens a data-layer attack surface: poisoned corpus entries can steer outputs without changing model parameters. Existing defenses and traceback methods are largely passage-level, which is too coarse for modern attacks whose effective payload may be a short fabricated claim, trigger phrase, or hidden instruction embedded inside an otherwise benign chunk. We study black-box character-level poison traceback in RAG and present RAGCharacter, a two-pass forensic framework that localizes the responsible retrieved span for a concrete misgeneration event. Pass-0 runs standard RAG while logging a prompt-anchored execution trace. Pass-1 re-enters a triggered trace and performs event-conditioned traceback over prompt-used evidence via budgeted counterfactual masking and replay, yielding an attribution span for forensic reporting and a causal span under the logged trace. We further introduce an evaluation protocol that measures both event-level chunk traceback and character-level localization fidelity. Across two QA corpora, five poisoning attack families, six target LLMs, and multiple passage- and character-level baselines, RAGCharacter achieves the best overall trade-off within our benchmark between localization accuracy and low over-attribution. These results suggest that prompt-conditioned, black-box character-level traceback can be feasible, moving RAG forensics from document-level suspicion toward finer-grained evidence auditing and potential remediation.

翻译：检索增强生成（RAG）通过将大语言模型与检索证据进行条件化处理来提升事实依据，但同时也引入了数据层攻击面：被投毒的语料库条目无需修改模型参数即可改变输出。现有防御与溯源方法主要基于段落级别，这对于现代攻击而言粒度过于粗糙——其有效载荷可能是短小的伪造声明、触发短语或嵌入良性文本块中的隐蔽指令。我们研究了RAG中黑盒字符级中毒溯源问题，并提出了RAGCharacter——一种双阶段取证框架，可定位具体错误生成事件中引发问题的检索跨度。阶段0运行标准RAG并记录提示锚定的执行轨迹；阶段1重新进入被触发轨迹，通过预算限制的反事实遮蔽与重放对提示使用证据执行事件条件化溯源，生成用于取证报告的归因跨度及基于记录轨迹的因果跨度。我们进一步提出评估协议，可同时衡量事件级段落溯源与字符级定位保真度。在两组QA语料库、五类投毒攻击族、六个目标大语言模型及多个段落/字符级基线方法的对比中，RAGCharacter在定位准确度与低过度归因之间实现了本基准测试范围内的最佳整体权衡。这些结果表明，基于提示条件化的黑盒字符级溯源具备可行性，推动RAG取证从文档级怀疑迈向更细粒度的证据审计与潜在修复。