Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model's refusal mechanism, and (2) encourage steering the semantic relevance of responses toward the targeted harmful content. Experimental results show improved attack success rates across multiple models and benchmarks, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/TROJail. Warning: This paper contains examples of harmful content.
翻译:大型语言模型已得到广泛应用,但其仍易受多轮越狱攻击的威胁,这影响了其安全部署。因此,训练自动化多轮攻击器以探测模型安全漏洞的任务应运而生。然而,现有方法通常依赖于轮次级优化,这对于学习长期攻击策略而言并不充分。为弥补这一差距,我们将此任务形式化为一个多轮强化学习问题,直接以最终轮次响应的危害性作为结果奖励进行优化。针对结果奖励监督稀疏的问题,我们提出了TROJail,该方法采用两个过程奖励来评估中间提示的效用,并将其整合到优势估计中。这些奖励(1)惩罚那些触发模型拒绝机制的过度有害提示,并(2)鼓励将响应的语义相关性引导至目标有害内容。实验结果表明,该方法在多个模型和基准测试中均提高了攻击成功率,凸显了其有效性。代码可在 https://github.com/xxiqiao/TROJail 获取。警告:本文包含有害内容示例。