LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.

翻译：现有越狱技术大多依赖于求解离散组合优化问题，而近期方法则通过训练大语言模型生成多个对抗性提示。然而，这两种方法都需要大量计算资源才能产生单个对抗性提示。我们假设当前方法的低效源于对越狱问题表征的不足。为弥补这一缺陷，我们从对齐角度重新形式化越狱问题：基于现有安全对齐模型，利用不安全奖励信号，通过对齐技术（例如基于人类反馈的强化学习）引导安全模型生成不安全输出，从而实现对安全对齐的逆向破解。我们提出名为LIAR（利用对齐实现越狱）的新型越狱方法。为验证该方法的简洁性与有效性，我们采用最佳N选择策略解决对齐问题。LIAR具备显著优势：无需额外训练的低计算需求、完全黑盒操作、具有竞争力的攻击成功率，以及更高的人类可读性。我们通过理论分析揭示了安全对齐模型存在越狱可能性的内在机理，指出现有大语言模型对齐策略的固有脆弱性，并为所提算法提供了次优性保证。实验结果表明，本方法在保持与前沿技术相当攻击成功率的同时，将困惑度降低10倍，并将攻击耗时从数十小时缩短至秒级。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日