In a prompt injection attack, an attacker injects a prompt into the original one, aiming to make the LLM follow the injected prompt and perform a task chosen by the attacker. Existing prompt injection attacks primarily focus on how to blend the injected prompt into the original prompt without altering the LLM itself. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we show that an attacker can boost the success of prompt injection attacks by poisoning the LLM's alignment process. Specifically, we propose PoisonedAlign, a method to strategically create poisoned alignment samples. When even a small fraction of the alignment data is poisoned using our method, the aligned LLM becomes more vulnerable to prompt injection while maintaining its foundational capabilities. The code is available at https://github.com/Sadcardation/PoisonedAlign
翻译:在提示注入攻击中,攻击者将恶意提示注入原始提示,旨在使大语言模型遵循注入的提示并执行攻击者选择的任务。现有的提示注入攻击主要关注如何在不改变大语言模型本身的情况下将注入提示融入原始提示。我们的实验表明,这些攻击虽取得一定成功,但仍有显著改进空间。本研究表明,攻击者可通过毒化大语言模型的对齐过程来提升提示注入攻击的成功率。具体而言,我们提出PoisonedAlign方法,用于策略性地创建毒化对齐样本。当使用我们的方法毒化即使一小部分对齐数据时,对齐后的大语言模型在保持其基础能力的同时,会变得更容易受到提示注入攻击。代码发布于https://github.com/Sadcardation/PoisonedAlign