Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.
翻译:越狱攻击旨在通过诱导大型语言模型生成有害内容来利用其漏洞,从而揭示模型的安全缺陷。理解并应对这些攻击对于推动大型语言模型安全性研究至关重要。现有的越狱方法主要聚焦于对有害意图的直接操纵,而较少关注角色提示(persona prompts)的影响。本研究系统性地探索了角色提示在突破大型语言模型防御机制方面的有效性。我们提出了一种基于遗传算法的方法,能够自动构建角色提示以绕过大型语言模型的安全机制。实验结果表明:(1)进化后的角色提示使多个大型语言模型的拒绝率降低50%-70%;(2)这些提示与现有攻击方法结合时展现出协同效应,将成功率提升10%-20%。我们的代码与数据已在 https://github.com/CjangCjengh/Generic_Persona 公开。