Does Refusal Training in LLMs Generalize to the Past Tense?

from arxiv, Update in v3: o1-mini and o1-preview results (on top of GPT-4o and Claude 3.5 Sonnet added in v2). We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense

Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g., "How to make a Molotov cocktail?" to "How did people make a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future tense are less effective, suggesting that refusal guardrails tend to consider past historical questions more benign than hypothetical future questions. Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. Overall, our findings highlight that the widely used alignment techniques -- such as SFT, RLHF, and adversarial training -- employed to align the studied models can be brittle and do not always generalize as intended. We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.

翻译：拒绝训练被广泛用于防止大语言模型生成有害、不良或非法的输出。我们揭示了当前拒绝训练方法中存在一个值得注意的泛化缺陷：仅将有害请求简单地改写成过去时态（例如，将“如何制作莫洛托夫鸡尾酒？”改为“人们过去是如何制作莫洛托夫鸡尾酒的？”），往往就足以成功越狱许多最先进的大语言模型。我们使用GPT-3.5 Turbo作为改写模型，在Llama-3 8B、Claude-3.5 Sonnet、GPT-3.5 Turbo、Gemma-2 9B、Phi-3-Mini、GPT-4o mini、GPT-4o、o1-mini、o1-preview以及R2D2模型上系统评估了此方法。例如，在JailbreakBench的有害请求上，使用GPT-4作为越狱评判器，这种简单攻击在GPT-4o上的成功率从直接请求的1%提升至使用20次过去时态改写尝试后的88%。有趣的是，我们还发现未来时态的改写效果较差，这表明拒绝防护机制倾向于将过去的历史性问题视为比假设的未来性问题更为良性。此外，我们在GPT-3.5 Turbo上的微调实验表明，当微调数据中明确包含过去时态示例时，防御过去时态改写是可行的。总体而言，我们的研究结果强调，用于对齐所研究模型的广泛使用的对齐技术——例如监督微调、基于人类反馈的强化学习和对抗训练——可能具有脆弱性，并不总能按预期实现泛化。我们在https://github.com/tml-epfl/llm-past-tense提供了代码和越狱数据。