Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: "When during pretraining should safety interventions be introduced?" We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.
翻译:先前的研究表明,在预训练期间应用安全干预(例如移除和重述有害内容)可以显著提升所得模型的鲁棒性。本文研究了一个先前工作忽视的基本问题:“安全干预应在预训练的哪个阶段引入?”我们保持底层数据源和预训练干预措施不变,仅改变干预的起始时间(在预训练完成0%、20%或60%的token后)。我们发现最优起始时间并非一成不变:在使用标准top-k解码的情况下,在仅使用安全数据进行短暂初始预训练(20%-60%)后引入干预,通常能产生最强的鲁棒性,其最显著的效益在下游良性微调后显现。相比之下,对于安全感知推理,从一开始就引入干预能提高生成内容朝向更安全方向的引导性。最后,我们观察到较早的干预会重塑内部表征:线性探针能更清晰地区分安全与有害示例。我们的研究首次确立了干预时机作为安全课程设计中的一个关键选择。