Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners consider pretraining for alignment alongside capabilities. We share our models, data, and evaluations at AlignmentPretraining.ai.
翻译:预训练语料库包含大量关于AI系统的论述,然而这种论述对下游对齐的因果影响仍知之甚少。如果对AI行为的普遍描述主要是负面的,大型语言模型(LLMs)可能会内化相应的行为先验,从而引发自我实现的错位对齐。本文通过使用不同数量的(错位)对齐论述预训练6.9B参数的大型语言模型,首次对这一假设进行了受控研究。我们发现,关于AI的讨论会导致错位对齐。上采样关于AI错位对齐的合成训练文档会导致错位行为显著增加。相反,上采样关于对齐行为的文档可将错位分数从45%降至9%。我们认为这是自我实现对齐的证据。这些效应在训练后有所减弱,但仍持续存在。我们的研究结果确立了研究预训练数据如何塑造对齐先验(即对齐预训练)作为训练后对齐的补充。我们建议实践者在考虑能力的同时,也将对齐纳入预训练考量。我们在AlignmentPretraining.ai上分享了我们的模型、数据和评估结果。