Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution. Due to the substantial discrepancy between the fixed target and the output distribution, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response. To address this limitation, we propose Dynamic Target Attack (DTA), which leverages the target LLM's own responses as adaptive targets. In each optimization round, DTA samples multiple candidates from the output distribution conditioned on the current prompt, and selects the most harmful one as a temporary target for prompt optimization. Extensive experiments demonstrate that, under the white-box setting, DTA achieves over 87% average attack success rate (ASR) within 200 optimization iterations on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15% and reducing wall-clock time by 2-26x. Under the black-box setting, DTA employs a white-box LLM as a surrogate model for gradient-based optimization, achieving an average ASR of 77.5% against black-box models, exceeding prior transfer-based attacks by over 12%.
翻译:现有的基于梯度的越狱攻击通常通过优化一个对抗性后缀来诱导一个固定的肯定性回复,例如“当然,这里是...”。然而,这个固定目标通常位于经过安全对齐的大语言模型输出分布中一个极低密度的区域。由于固定目标与输出分布之间存在显著差异,现有攻击需要大量迭代来优化对抗性提示,并且可能仍然无法诱导出低概率的目标回复。为了解决这一局限性,我们提出了动态目标攻击,该方法利用目标大语言模型自身的回复作为自适应目标。在每一轮优化中,DTA从当前提示条件下的输出分布中采样多个候选回复,并选择其中最具危害性的一个作为提示优化的临时目标。大量实验表明,在白盒设置下,DTA在200次优化迭代内,在最新的安全对齐大语言模型上实现了超过87%的平均攻击成功率,超过了最先进的基线方法超过15%,并将实际运行时间减少了2到26倍。在黑盒设置下,DTA使用一个白盒大语言模型作为基于梯度优化的代理模型,针对黑盒模型实现了77.5%的平均攻击成功率,超过了先前的基于迁移的攻击方法超过12%。