We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for different adversarial objectives. Prior methods either produce easily detectable gibberish, or cannot handle objectives that include embedding similarity. In particular, they only work for direct attacks (such as jailbreaking) and cannot produce adversarial text for realistic indirect injection, e.g., documents that (1) are retrieved in RAG systems in response to broad classes of queries, and also (2) adversarially influence subsequent generation. We also show that fluency (low perplexity) is not sufficient to evade filtering. We measure the effectiveness of adversarial decoding for different objectives, including RAG poisoning, jailbreaking, and evasion of defensive filters, and demonstrate that it outperforms existing methods while producing readable adversarial documents.
翻译:我们设计、实现并评估了一种新颖且通用的文本生成技术——对抗性解码,该技术可为不同的对抗性目标生成可读文档。现有方法要么生成易于检测的乱码,要么无法处理包含嵌入相似性的目标。具体而言,它们仅适用于直接攻击(如越狱),而无法为现实的间接注入生成对抗性文本,例如:(1)在RAG系统中响应广泛查询类别时被检索到,同时(2)对后续生成产生对抗性影响的文档。我们还证明流畅性(低困惑度)不足以规避过滤检测。我们测量了对抗性解码在不同目标下的有效性,包括RAG投毒、越狱以及防御过滤器规避,并证明其在生成可读对抗文档的同时,性能优于现有方法。