Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.
翻译:潜性学习是指学生语言模型在教师模型输出上进行微调时,习得教师模型的特性(例如系统提示偏好猫头鹰),尽管这些输出与特性在语义上无关。目前尚不清楚缺乏语义含义的数据如何传递特定的语义特性。本文证明潜性学习由单一引导向量介导,即添加到模型激活中的向量。通过对两个开源模型的实验,我们发现教师模型的系统提示可被引导向量良好近似,而学生模型的行为则源于微调过程中学习到对齐向量。不能被引导向量良好近似的系统提示无法被潜性学习。这是引导向量蒸馏的一个特例:在引导教师模型输出上训练的学生模型会模仿该引导行为。我们在多种语义向量和随机向量上演示了引导向量蒸馏。将语义向量添加到模型激活中可能对模型行为产生模型无关和模型特异(即非语义)的影响,因此非语义的生成数据可以传递具有语义效应的向量,从而实现潜性学习。这也解释了为何潜性学习无法在模型间迁移。我们发现自适应优化器对语言模型中的潜性学习至关重要:引导数据上的激活梯度沿引导方向携带微小但一致的成分,而非自适应优化器允许异常梯度主导,从而阻碍这一过程。