As LLMs continue to shape real-world applications, automated jailbreak generation becomes essential to reveal safety weaknesses and guide model improvement. Existing automatic jailbreak generation methods have not yet fully considered two important aspects: adaptability to evolving safety-finetuned models, which affects their effectiveness on newer model versions, and diversity in generated prompts, which can cause narrow or repetitive attack patterns. To address these issues, we propose EvoJail, an instruction-fusion-driven evolutionary jailbreak generation framework that formalizes jailbreak prompt generation as a multi-objective black-box optimization problem and leverages the principles of evolutionary algorithms to search for jailbreak prompts that can adapt across different model versions and exhibit diverse attack patterns. Specifically, EvoJail integrates jailbreak prompt generation into an iterative evolutionary loop, where at each iteration candidate prompts are evaluated directly against the target model and then selected and varied based on the target model's responses, enabling the generation process to continuously adapt to model updates. To enhance diversity, EvoJail introduces field-aware instruction fusion to construct diverse starting points and incorporates diversity-aware objectives into the evolutionary fitness function, guiding the search toward prompts with richer semantic variation, while further designing multi-level LLM-based mutation operators that modify prompt structures at different granularities to promote structural diversity throughout the evolutionary process. Results demonstrate that EvoJail has stronger adaptability and can achieve over $93\%$ attack success rate and more than $5.6\%$ improvement in diversity metrics over state-of-the-art methods.
翻译:随着大语言模型持续塑造实际应用场景,自动化越狱生成技术对于揭示安全缺陷并指导模型改进变得至关重要。现有自动越狱生成方法尚未充分考虑两个重要方面:对持续安全微调模型的适应性(这将影响其在新版本模型上的有效性),以及生成提示的多样性(这可能导致攻击模式狭窄或重复)。为解决这些问题,我们提出EvoJail——一种指令融合驱动的进化式越狱生成框架。该框架将越狱提示生成形式化为多目标黑盒优化问题,并利用进化算法原理搜索能跨不同模型版本适应且呈现多样化攻击模式的越狱提示。具体而言,EvoJail将越狱提示生成融入迭代进化循环:每轮迭代中候选提示直接针对目标模型进行评估,随后根据目标模型响应进行选择与变异,使生成过程持续适应模型更新。为增强多样性,EvoJail引入场感知指令融合构建多样化初始点,并将多样性感知目标纳入进化适应度函数,引导搜索向语义更丰富的提示方向演进;同时进一步设计多层级基于LLM的变异算子,在不同粒度下修改提示结构以促进进化过程中的结构多样性。实验结果表明,EvoJail具有更强的适应性,攻击成功率超过93%,且多样性指标较现有最优方法提升超过5.6%。