While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
翻译:尽管大型语言模型(LLMs)在各种任务中展现出卓越能力,但它们面临着潜在的安全风险,例如越狱攻击——这种攻击通过利用模型漏洞绕过安全措施来生成有害输出。现有的越狱策略主要侧重于最大化攻击成功率(ASR),常常忽视其他关键因素,包括越狱响应与查询的相关性以及隐蔽性水平。这种对单一目标的狭隘关注可能导致攻击效果不佳,即生成的响应要么缺乏上下文相关性,要么容易被识别。本研究提出BlackDAN,一种创新的多目标优化黑盒攻击框架,旨在生成高质量提示,在保持上下文相关性和最小化可检测性的同时有效促进越狱。BlackDAN利用多目标进化算法(MOEAs),特别是NSGA-II算法,在ASR、隐蔽性和语义相关性等多个目标上优化越狱提示。通过整合变异、交叉和帕累托支配等机制,BlackDAN为生成越狱提示提供了透明且可解释的流程。此外,该框架支持基于用户偏好的定制化,允许选择平衡危害性、相关性及其他因素的提示。实验结果表明,BlackDAN优于传统的单目标方法,在各种LLMs和多模态LLMs上实现了更高的成功率和更强的鲁棒性,同时确保越狱响应既具有相关性又更难以被检测。