Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai
翻译:预训练语料库包含大量关于AI系统的论述,然而这种论述对下游对齐的因果影响仍知之甚少。如果对AI行为的普遍描述主要是负面的,大型语言模型可能会内化相应的行为先验,从而导致自我实现的错位对齐。本文通过使用不同数量(错位)对齐论述预训练69亿参数的大型语言模型,首次对这一假设进行了对照研究。我们发现关于AI的讨论会导致错位对齐。上采样关于AI错位对齐的合成训练文档会导致错位行为显著增加。相反,上采样关于对齐行为的文档可将错位分数从45%降低至9%。我们认为这是自我实现对齐的证据。这些效应在训练后会减弱,但仍持续存在。我们的研究结果确立了研究预训练数据如何塑造对齐先验(即对齐预训练)作为训练后对齐的补充。我们建议从业者同时针对对齐性和能力进行预训练。我们的模型和数据集可在alignmentpretraining.ai获取。