While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
翻译:尽管大型语言模型(LLMs)在各种任务中展现出卓越能力,但它们面临着潜在的安全风险,例如越狱攻击,此类攻击利用模型漏洞绕过安全措施并生成有害输出。现有的越狱策略主要侧重于最大化攻击成功率(ASR),常常忽视其他关键因素,包括越狱响应与查询的相关性以及隐蔽性水平。这种对单一目标的狭隘关注可能导致攻击效果不佳,生成的响应要么缺乏上下文相关性,要么容易被识别。本文提出BlackDAN,一种创新的黑盒多目标优化攻击框架,旨在生成高质量提示,在保持上下文相关性和最小化可检测性的同时,有效促进越狱攻击。BlackDAN利用多目标进化算法(MOEAs),特别是NSGA-II算法,在多个目标(包括ASR、隐蔽性和语义相关性)上优化越狱提示。通过整合突变、交叉和帕累托支配等机制,BlackDAN为生成越狱提示提供了一个透明且可解释的流程。此外,该框架支持基于用户偏好的定制,允许选择在危害性、相关性及其他因素之间取得平衡的提示。实验结果表明,BlackDAN优于传统的单目标方法,在各种LLMs和多模态LLMs上实现了更高的成功率和更强的鲁棒性,同时确保越狱响应既相关又更难以被检测。