Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
翻译:尽管研究者们致力于将大型语言模型对齐以生成无害响应,但这些模型仍易受到旨在诱导其输出不受限制行为的越狱提示攻击。本研究将角色调制作为一种黑盒越狱方法,通过引导目标模型接纳愿意执行有害指令的人格特征来实施攻击。我们并未手动为每种角色设计提示,而是利用语言模型辅助器自动化生成越狱提示。实验证明了角色调制能够产生多种有害输出,包括合成甲基苯丙胺、制造炸弹及洗钱的详细指导。此类自动化攻击在GPT-4上实现了42.5%的有害输出率,较调制前的0.23%提升了185倍。这些提示同样可迁移至Claude 2与Vicuna,分别达到61.0%和35.9%的有害输出率。本研究揭示了商用大型语言模型的又一安全漏洞,并强调了建立更全面防护机制的必要性。