Ensuring the safety of language models in high-stakes settings remains a pressing challenge, as aligned behaviors are often brittle and easily undone by adversarial pressure or downstream finetuning. Prior work has shown that interventions applied during pretraining, such as rephrasing harmful content, can substantially improve the safety of the resulting models. In this paper, we study the fundamental question: "When during pretraining should safety interventions be introduced?" We keep the underlying data fixed and vary only the choice of a safety curriculum: the timing of these interventions, i.e., after 0%, 20%, or 60% of the pretraining token budget. We find that introducing interventions earlier generally yields more robust models with no increase in overrefusal rates, with the clearest benefits appearing after downstream, benign finetuning. We also see clear benefits in the steerability of models towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Overall, these results argue for incorporating safety signals early in pretraining, producing models that are more robust to downstream finetuning and jailbreaking, and more reliable under both standard and safety-aware inference procedures.
翻译:确保语言模型在高风险场景下的安全性仍是一个紧迫挑战,因为对齐行为往往具有脆弱性,容易在对抗性压力或下游微调中被破坏。先前研究表明,在预训练期间施加干预(例如重述有害内容)可显著提升最终模型的安全性。本文研究一个基础性问题:"安全干预应在预训练过程中的哪个阶段引入?"我们保持底层数据不变,仅改变安全课程设置:即干预时机(在预训练token预算的0%、20%或60%之后引入)。研究发现,较早引入干预通常能产生更鲁棒的模型,且不会增加过度拒绝率,其最显著的效益体现在下游良性微调后。我们还观察到模型向安全生成的可操控性获得明显提升。最后,我们发现早期干预会重塑内部表征:线性探针能更清晰地区分安全与有害样本。总体而言,这些结果表明应在预训练早期融入安全信号,由此产生的模型对下游微调和越狱攻击具有更强鲁棒性,且在标准推理与安全感知推理过程中都更为可靠。