Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix "Sure, here is (harmful request)". While straightforward, this objective has two limitations: limited control over model behaviors, often resulting in incomplete or unrealistic responses, and a rigid format that hinders optimization. To address these limitations, we introduce AdvPrefix, a new prefix-forcing objective that enables more nuanced control over model behavior while being easy to optimize. Our objective leverages model-dependent prefixes, automatically selected based on two criteria: high prefilling attack success rates and low negative log-likelihood. It can further simplify optimization by using multiple prefixes for a single user request. AdvPrefix can integrate seamlessly into existing jailbreak attacks to improve their performance for free. For example, simply replacing GCG attack's target prefixes with ours on Llama-3 improves nuanced attack success rates from 14% to 80%, suggesting that current alignment struggles to generalize to unseen prefixes. Our work demonstrates the importance of jailbreak objectives in achieving nuanced jailbreaks.
翻译:许多针对大语言模型(LLM)的越狱攻击依赖于一个共同目标:使模型以“Sure, here is (harmful request)”为前缀进行回复。这一目标虽然直接,但存在两个局限:对模型行为的控制有限,常导致回复不完整或不切实际;以及僵化的格式阻碍了优化。为应对这些局限,我们提出了AdvPrefix,一种新的前缀强制目标函数,它能在易于优化的同时实现对模型行为更细致的控制。我们的目标函数利用模型依赖的前缀,这些前缀基于两个标准自动选择:高预填充攻击成功率与低负对数似然。它还能通过为单个用户请求使用多个前缀来进一步简化优化。AdvPrefix可以无缝集成到现有越狱攻击中以免费提升其性能。例如,在Llama-3上仅将GCG攻击的目标前缀替换为我们的前缀,即可将细致攻击成功率从14%提升至80%,这表明当前的模型对齐方法难以泛化到未见过的前缀。我们的工作证明了越狱目标函数在实现细致越狱中的重要性。