Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient method to attack aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong
翻译:大型语言模型(LLMs)易受越狱攻击,导致生成有害、不道德或带有偏见的文本。然而,现有越狱方法的计算成本高昂。本文提出弱到强越狱攻击,一种高效攻击对齐LLMs以生成有害文本的方法。我们的关键直觉基于观察:被越狱与对齐模型的初始解码分布存在差异。弱到强攻击的核心技术见解是利用两个较小的模型(一个安全、一个不安全)来对抗性地修改一个显著更大的安全模型的解码概率。我们在来自3个组织的5种不同LLMs上评估了弱到强攻击。结果表明,在两个数据集上,我们的方法通过每个示例仅一次前向传播即可将错位率提升至超过99%。我们的研究暴露了一个在对齐LLMs时亟待解决的紧急安全问题。作为初步尝试,我们提出了一种防御策略以抵御此类攻击,但构建更先进的防御手段仍具挑战性。复现该方法的代码可在https://github.com/XuandongZhao/weak-to-strong获取。