Recent work showed that retrieval based on embedding similarity (e.g., for retrieval-augmented generation) is vulnerable to poisoning: an adversary can craft malicious documents that are retrieved in response to broad classes of queries. We demonstrate that previous, HotFlip-based techniques produce documents that are very easy to detect using perplexity filtering. Even if generation is constrained to produce low-perplexity text, the resulting documents are recognized as unnatural by LLMs and can be automatically filtered from the retrieval corpus. We design, implement, and evaluate a new controlled generation technique that combines an adversarial objective (embedding similarity) with a "naturalness" objective based on soft scores computed using an open-source, surrogate LLM. The resulting adversarial documents (1) cannot be automatically detected using perplexity filtering and/or other LLMs, except at the cost of significant false positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy to easily-detectable documents generated using HotFlip, and (3) are significantly more effective than prior methods for energy-guided generation, such as COLD.
翻译:近期研究表明,基于嵌入相似度的检索(例如用于检索增强生成)易受投毒攻击:攻击者可以制作恶意文档,使其在响应广泛类别的查询时被检索到。我们证明,先前基于HotFlip的技术生成的文档极易通过困惑度过滤被检测到。即使生成过程被约束为产生低困惑度文本,所得文档仍会被LLM识别为非自然文本,并可从检索语料库中自动过滤。我们设计、实现并评估了一种新的可控生成技术,该技术将对抗性目标(嵌入相似度)与基于开源替代LLM计算的软分数的"自然性"目标相结合。生成的对抗性文档具有以下特性:(1) 无法通过困惑度过滤和/或其他LLM自动检测,除非以检索语料库中产生显著误报为代价;(2) 在投毒效果上与使用HotFlip生成的易检测文档相当;(3) 相较于先前的能量引导生成方法(如COLD)具有显著更高的有效性。