While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
翻译:尽管大型语言模型(LLMs)在各种任务中展现出卓越的能力,但它们面临着潜在的安全风险,例如越狱攻击,这些攻击利用漏洞绕过安全措施并生成有害输出。现有的越狱策略主要侧重于最大化攻击成功率(ASR),常常忽视其他关键因素,包括越狱响应与查询的相关性以及隐蔽性水平。这种对单一目标的狭隘关注可能导致攻击无效,即要么缺乏上下文相关性,要么容易被识别。在本研究中,我们提出了BlackDAN,一种创新的基于多目标优化的黑盒攻击框架,旨在生成高质量的提示,在保持上下文相关性和最小化可检测性的同时,有效促进越狱。BlackDAN利用多目标进化算法(MOEAs),特别是NSGA-II算法,在多个目标(包括ASR、隐蔽性和语义相关性)上优化越狱攻击。通过整合突变、交叉和帕累托支配等机制,BlackDAN为生成越狱攻击提供了一个透明且可解释的过程。此外,该框架允许基于用户偏好进行定制,从而能够选择平衡危害性、相关性及其他因素的提示。实验结果表明,BlackDAN优于传统的单目标方法,在各种LLMs和多模态LLMs上实现了更高的成功率和更强的鲁棒性,同时确保越狱响应既相关又更难以被检测。