Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo
翻译:尽管人工智能对齐技术取得了进展,大型语言模型(LLM)仍然容易受到对抗性攻击或越狱攻击的影响,攻击者可以通过修改提示来诱导模型产生不良行为。虽然已有一些防御方法被提出,但它们未能适应新提出的攻击方式和更具挑战性的威胁模型。为解决这一问题,我们提出了一种基于优化的目标函数,用于防御LLM免受越狱攻击,并设计了一种算法——鲁棒提示优化(RPO),以构建鲁棒的系统级防御。我们的方法直接将对抗者纳入防御目标中,并优化一个轻量级且可迁移的后缀,使RPO能够适应最坏情况下的自适应攻击。我们的理论和实验结果表明,该方法对优化过程中见过的越狱攻击和未知越狱攻击均表现出更强的鲁棒性,在JailbreakBench基准测试中,将GPT-4的攻击成功率(ASR)降低至6%,将Llama-2的攻击成功率降低至0%,达到了当前最佳性能。代码可在 https://github.com/lapisrocks/rpo 找到。