Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.

翻译：尽管大型语言模型（LLMs）通过安全微调方法已与人类偏好对齐，但已知其仍易受精心设计的对抗性攻击或越狱的影响，导致生成不当内容。虽然输入标记空间的高维度特性使得找到能够越狱这些模型的对抗性提示不可避免，但我们的目标是评估经过安全微调的LLMs在面对自然提示时是否安全，这些自然提示在语义上与有毒种子提示相关，而后者在模型对齐后能引发安全响应。我们惊人地发现，诸如GPT-4等流行的对齐LLMs甚至可能被并非以越狱为目标的简单提示所攻破。此外，我们通过实验证明，给定一个能从非对齐模型中引发有毒响应的种子提示，可以系统地生成多个语义相关的自然提示来越狱对齐的LLMs。为此，我们提出了一种响应引导的问题增强方法（ReG-QA）来评估安全对齐LLMs对自然提示的泛化能力。该方法首先使用非对齐LLM（Q到A）根据种子问题生成多个有毒答案，然后利用LLM生成可能产生这些答案的问题（A到Q）。有趣的是，我们发现像GPT-4o这样的安全微调LLMs容易从非安全内容（未经否认）生成自然的越狱问题，因此可用于后一步（A到Q）。我们获得的攻击成功率与JailbreakBench排行榜上的领先对抗性攻击方法相当甚至更优，同时对Smooth-LLM和同义词替换等防御措施表现出显著更高的稳定性，而这些防御措施对排行榜上现有的所有攻击均有效。