Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.
翻译:当前减少语言模型中不良能力的方法大多是事后补救,因此容易被对手轻易规避。一种自然的替代方案是在预训练阶段直接塑造模型能力。以消除医学能力为代理任务,我们证明简单的预训练数据过滤干预在大规模应用中具有高效性、鲁棒性和低成本性。受数据归因研究的启发,我们发现词元级过滤比文档级过滤更有效,能在降低对良性能力影响的同时达到同等的不良能力抑制效果。通过训练跨越两个数量级的模型,我们进一步证明过滤效果随模型规模扩大而增强:对于最大规模的模型,词元过滤使遗忘领域的计算速度降低7000倍。我们还证明经过词元过滤训练的模型仍可在遗忘领域进行对齐。在此过程中,我们提出了一种利用稀疏自编码器标注词元并蒸馏廉价高质量分类器的方法论。最后,我们证明在充足的预训练计算资源下,过滤方法对噪声标签具有鲁棒性。