Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
翻译:大型语言模型(LLMs)已在多种应用中取得显著成功,但仍易受越狱攻击的影响,即攻击者通过精心构造的提示词绕过安全对齐机制,诱导模型生成不安全回复。在现有方法中,基于优化的攻击已展现出强大的效力,然而当前方法通常存在频繁拒绝、伪有害输出以及低效的令牌级更新等问题。本文提出TAO-Attack,一种新的基于优化的越狱方法。TAO-Attack采用两阶段损失函数:第一阶段抑制拒绝行为,确保模型延续有害前缀;第二阶段惩罚伪有害输出,并引导模型生成更具危害性的补全内容。此外,我们设计了一种方向优先令牌优化(DPTO)策略,该策略通过优先将候选更新与梯度方向对齐,再考虑更新幅度,从而提升了优化效率。在多个LLMs上的大量实验表明,TAO-Attack始终优于现有最先进方法,实现了更高的攻击成功率,在某些场景下甚至达到100%。