Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.
翻译:现有越狱技术大多依赖于求解离散组合优化问题,而近期方法则通过训练大语言模型生成多个对抗性提示。然而,这两种方法都需要大量计算资源才能产生单个对抗性提示。我们假设当前方法的低效源于对越狱问题表征的不足。为弥补这一缺陷,我们从对齐角度重新形式化越狱问题:基于现有安全对齐模型,利用不安全奖励信号,通过对齐技术(例如基于人类反馈的强化学习)引导安全模型生成不安全输出,从而实现对安全对齐的逆向破解。我们提出名为LIAR(利用对齐实现越狱)的新型越狱方法。为验证该方法的简洁性与有效性,我们采用最佳N选择策略解决对齐问题。LIAR具备显著优势:无需额外训练的低计算需求、完全黑盒操作、具有竞争力的攻击成功率,以及更高的人类可读性。我们通过理论分析揭示了安全对齐模型存在越狱可能性的内在机理,指出现有大语言模型对齐策略的固有脆弱性,并为所提算法提供了次优性保证。实验结果表明,本方法在保持与前沿技术相当攻击成功率的同时,将困惑度降低10倍,并将攻击耗时从数十小时缩短至秒级。