Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.
翻译:大型语言模型(LLMs)在创意写作、代码生成和翻译等领域中根据输入序列生成文本,但容易遭受越狱攻击——即通过精心设计的提示词诱导模型输出有害内容。现有越狱提示方法通常采用“越狱模板+提问问题”的组合方式生成攻击提示词。然而,这类设计普遍存在语义差异过大的问题,导致无法抵御采用简单语义度量作为阈值的防御机制——越狱提示词与原始查询问题的语义差异显著高于常规提问。本文提出语义镜像越狱(SMJ)方法,通过生成与原始问题语义相似的越狱提示词来绕过LLMs。我们将同时满足语义相似性与越狱有效性的提示词搜索建模为多目标优化问题,并采用标准化遗传算法生成符合要求的提示词。相较于基线方法AutoDAN-GA,SMJ在无ONION防御时攻击成功率(ASR)最高提升35.4%,在ONION防御下最高提升85.2%。SMJ在越狱提示词、语义相似性和异常值三项语义意义度量指标上均表现更优,这意味着该方法能有效抵御以这些度量作为阈值的防御机制。