Despite advances in AI alignment, language models (LM) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. While some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, universal, and practical. To achieve this, we propose the first adversarial objective for defending LMs against jailbreaking attacks and an algorithm, robust prompt optimization (RPO), that uses gradient-based token optimization to enforce harmless outputs. This results in an easily accessible suffix that significantly improves robustness to both jailbreaks seen during optimization and unknown, held-out jailbreaks, reducing the attack success rate on Starling-7B from 84% to 8.66% across 20 jailbreaks. In addition, we find that RPO has a minor effect on normal LM use, is successful under adaptive attacks, and can transfer to black-box models, reducing the success rate of the strongest attack on GPT-4 from 92% to 6%.
翻译:尽管AI对齐技术取得了进展,语言模型仍易受到对抗攻击或越狱攻击,即攻击者通过修改输入提示来诱导模型生成有害行为。虽然已有若干防御方法被提出,但这些方法仅适用于狭窄的威胁模型,未能实现我们认定的有效、通用且实用的强防御。为此,我们提出了首个针对语言模型越狱攻击的防御对抗目标,并开发了一种基于梯度优化的鲁棒提示优化算法,通过强制生成无害输出来实现防御。该方法生成一个易于附加的后缀,能显著提升对优化过程中所见越狱攻击及未知保留越狱攻击的鲁棒性,在20种越狱攻击下将Starling-7B的攻击成功率从84%降至8.66%。此外,我们发现RPO对语言模型正常使用影响甚微,在自适应攻击下仍保持有效性,并能迁移至黑盒模型,将GPT-4上最强攻击的成功率从92%降至6%。