Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
翻译:尽管人们致力于对齐大型语言模型以生成无害响应,但它们仍然容易受到诱导其产生不受约束行为的越狱提示攻击。本研究探讨了将角色调制作为一种黑盒越狱方法,引导目标模型展现出愿意遵循有害指令的人格特征。我们并非为每种角色手动构建提示,而是利用语言模型助手自动化生成越狱攻击。我们展示了通过角色调制实现的一系列有害响应,包括详细指导如何合成甲基苯丙胺、制造炸弹以及洗钱。这些自动化攻击在GPT-4上实现了42.5%的有害响应率,比调制前(0.23%)提升了185倍。这些提示还可迁移至Claude 2和Vicuna,分别达到61.0%和35.9%的有害响应率。我们的工作揭示了商业大型语言模型中的又一漏洞,并强调需要更全面的安全防护措施。