Jailbreaks on Large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.
翻译:近期,大语言模型(LLMs)的越狱攻击日益受到关注。为全面评估LLM安全性,需考虑具有多样属性的越狱行为,如上下文连贯性、情感/风格变化等,因此开展可控越狱研究(即如何对LLM攻击施加控制)具有重要价值。本文正式定义了可控攻击生成问题,并建立该问题与自然语言处理中成熟方向——可控文本生成之间的创新性关联。基于此关联,我们改进了可控文本生成领域最先进且高效的算法——基于朗之万动力学的能量约束解码(COLD),提出COLD-Attack框架。该框架统一并自动化搜索满足流畅性、隐蔽性、情感倾向及左右连贯性等多类控制需求的对抗性LLM攻击。COLD-Attack启用的可控性催生了多种新型越狱场景:不仅涵盖生成流畅后缀攻击的标准设定,还支持以最小化释义方式将用户查询改造成对抗性内容、在上下文中插入具有左右连贯性的隐蔽攻击等新场景。我们在多样LLM(Llama-2、Mistral、Vicuna、Guanaco、GPT-3.5)上的大量实验证明,COLD-Attack具有广泛适用性、强可控性、高成功率及攻击可迁移性。代码已开源至https://github.com/Yu-Fangxu/COLD-Attack。