Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.
翻译:开放式权重语言模型可通过多种不同干预手段变得不安全,但由此产生的模型在能力、行为特征和内部故障模式上可能存在显著差异。我们研究了沿三条不安全路径越狱模型的行为与机制特性:有害监督微调(有害SFT)、带可验证奖励的有害强化学习(有害RLVR)以及拒绝抑制型失活。这三条路径均实现了接近天花板的有害服从率,但在超越直接危害性后出现分化。RLVR越狱模型表现出最低的性能退化,并在结构化自我审计中保留了明确的有害识别能力:它们能识别有害提示,描述安全大模型应如何响应,却仍会服从有害请求。通过RLVR,有害行为被一个反射性安全支架强力抑制:当有害提示前附加遵守安全标准的指令时,有害行为骤降至接近基线水平。类别特异性RLVR越狱能跨各类危害领域广泛泛化。SFT越狱模型在显性安全判断上表现出最大塌缩、最高行为漂移,以及在标准基准测试中显著的能力损失。失活在自我审计和对反射性安全支架的响应中均呈现家族依赖性。机制分析与修复分析进一步区分了各路径:失活符合局部化拒绝特征删除,RLVR保留安全几何结构但重定向策略行为,而SFT则表现为广泛分布式漂移。定向修复能部分恢复RLVR越狱模型,但对SFT越狱模型效果甚微。这些结果共同表明:尽管在直接危害性上表现相似,不同越狱方式可产生截然不同的模型特性,其中通过RLVR越狱的模型与基座模型展现出显著相似性。