Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs. The efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. Our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong
翻译:尽管已有大量工作致力于对齐大型语言模型(LLMs),但红队测试报告指出,这些经过精心对齐的LLMs仍可能通过对抗性提示、微调或解码方式被越狱。通过研究对齐LLMs的越狱脆弱性,我们观察到越狱模型与对齐模型的解码分布在初始生成阶段存在差异。这一发现促使我们提出"从弱到强"越狱攻击方法——攻击者可利用较小的不安全/对齐LLMs(如7B参数)来引导针对更大规模对齐LLMs(如70B参数)的越狱攻击。实施越狱仅需额外对两个小型LLM各进行一次解码,相比解码大型LLMs的计算开销与延迟几乎可忽略。我们在来自三家不同机构的五个模型上开展的实验验证了该攻击的有效性。本研究揭示了此前未被注意的高效越狱途径,暴露出对齐LLMs时亟需考虑的安全问题。作为初步尝试,我们提出了抵御此类攻击的防御策略,但构建更先进的防御体系仍具挑战性。方法复现代码已开源至https://github.com/XuandongZhao/weak-to-strong。