Recent work demonstrates that filtering harmful content from pretraining data improves model safety without degrading capabilities. We propose a natural extension: do it again. A model trained on filtered data can filter the corpus further; training on this cleaner corpus produces an even cleaner model. We provide theoretical analysis showing this process converges to a self-consistent corpus where the model trained on it approves of its own training data. Even under the weak assumption of constant filter quality, iteration yields decay in harmful content. We argue this framework offers a novel form of scalable oversight. While model internals are opaque, the resulting corpus is human-auditable. Even a single iteration produces a large-scale preference annotations over documents, potentially valuable for interpretability research. We derive bounds on capability-safety tradeoffs and outline open questions. We call on researchers with pretraining infrastructure to empirically test this approach.
翻译:近期研究表明,从预训练数据中过滤有害内容可在不降低模型能力的前提下提升其安全性。我们提出一种自然延伸方案:重复执行该过程。基于过滤后数据训练的模型可对语料库实施进一步过滤;在此更纯净语料库上训练将产生更纯净的模型。我们通过理论分析证明该过程会收敛至自洽语料库——基于该语料库训练的模型将认可其自身训练数据。即使在滤波器质量保持恒定的弱假设条件下,迭代过程仍能实现有害内容的指数衰减。我们认为该框架提供了一种新型可扩展监督形式:虽然模型内部机制不透明,但最终生成的语料库可供人工审计。即使仅执行单次迭代,也能产生面向文档的大规模偏好标注,这对可解释性研究具有潜在价值。我们推导了能力-安全性权衡的理论边界,并提出了若干开放性问题。我们呼吁拥有预训练基础设施的研究者对此方法进行实证检验。