Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.
翻译:现有基于梯度的大型语言模型(LLM)越狱攻击通常通过优化对抗性后缀,使LLM输出与预定义的目标响应对齐。然而,将目标限制为诱导固定响应本质上约束了对抗搜索空间,限制了整体攻击效能。此外,现有方法通常需要大量优化迭代来弥合固定目标与原始LLM输出之间的巨大差距,导致攻击效率低下。为克服这些局限,我们提出了首个基于梯度的无目标越狱攻击(UJA),该方法依赖无目标优化目标来最大化LLM输出的不安全概率,而不强制任何响应模式。为实现可优化性,我们进一步将该目标分解为两个可微子目标:搜索最优有害响应及对应对抗性提示,并通过理论分析验证了该分解的有效性。与现有攻击相比,UJA的无限制目标显著扩展了搜索空间,实现了对LLM漏洞更灵活高效的探索。大量评估表明,UJA仅需100次优化迭代即可对近期安全对齐的LLM实现超过80%的攻击成功率,较当前最先进的基于梯度的攻击方法提升超过30%。