When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.
翻译:当语言模型在合成数据上进行训练时,它们(学生模型)可能会从数据生成模型(教师模型)中隐秘地习得行为特征。潜意识学习指的是通过训练与这些特征无关的数据,将特征从教师模型传递给学生模型。先前的研究已在数字序列、代码以及包含未对齐行为传递的数学思维链追踪等训练领域中证实了这一点。我们研究这种传递是否通过具有固定语义内容的自然语言释义发生,以及明确与教师偏好相矛盾的内容是否能阻断这种传递。我们发现,在来自一个被系统提示喜爱特定动物的教师模型生成的释义上进行训练,会使学生模型对该动物的偏好增加高达19个百分点。这种情况发生在释义内容在语义上与动物无关时,甚至当内容明确表达不喜欢时。尽管进行了严格的过滤以确保释义的忠实性,这种传递仍然成功。这引发了人们对模型生成自身训练数据的流程的担忧:基于内容的检查无法检测到此类传递,即使是偏好相矛盾的内容也无法阻止它。