Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into account the sequential structure of how models resist or yield to attacks. We propose applying process mining, a discipline for discovering and analyzing process models from event logs, to red teaming traces. We conduct a controlled experiment pitting 60 HarmBench prompts against two LLMs, GPT-OSS 120B and Llama 3.3 70B, using 10 prompt mutation strategies over up to 110 attempts per prompt. From the resulting 8,575 scored events we extract Directly-Follows Graphs (DFGs) and state transition matrices that reveal structurally distinct defense profiles invisible to ASR alone: GPT-OSS exhibits a near-absorbing refusal state, while Llama presents multiple porous escape routes from refusal to getting successfully jailbroken. We further show that mutator effectiveness is asymmetric across models and that time-to-jailbreak distributions differ by an order of magnitude.
翻译:标准AI红队评估将对抗性攻击简化为单一的二值化结果——攻击成功率(ASR),未能考虑模型在抵御或屈服于攻击时的序列化结构。我们提出将过程挖掘(一门从事件日志中挖掘和分析过程模型的学科)应用于红队攻击轨迹。我们通过受控实验,使用60条HarmBench提示词,对GPT-OSS 120B和Llama 3.3 70B两个大语言模型,采用10种提示变异策略(每条提示最多尝试110次)进行对抗。从产生的8,575个评分事件中,我们提取了直接跟随图(DFG)和状态转移矩阵,揭示了仅凭ASR无法观察到的结构迥异的防御轮廓:GPT-OSS呈现出接近吸收态的拒绝状态,而Llama则展示了从拒绝到成功越狱的多条多孔性逃逸路径。我们进一步证明,变异器有效性在模型间具有不对称性,且越狱时间分布相差一个数量级。