Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.
翻译:语言模型的安全微调通常需要精心策划的对抗性数据集。我们采用了一种不同的方法:根据目标模型自身生成内容被判定为有害的频率,对每个候选提示的难度进行评分,然后针对最难提示及其对应的模型未越狱生成内容进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上,该方法将WildJailbreak攻击成功率从11.5%和20.1%降低至1-3%,但模型对似越狱良性提示的拒绝率从14-22%上升至74-94%。将相同的困难提示与对抗性框架下的良性提示(看似越狱但意图良性的提示)按1:1比例交错混合,可将8B模型的拒绝率降至30-51%,3B模型降至52-72%,但攻击成功率相应增加2-6个百分点。在混合训练机制中,基于候选池中最难的一半(而非随机一半)进行训练,可使两款模型的剩余攻击成功率降低35-50%(约3个百分点)。