Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.
翻译:涌现错位是指一种失效模式,即在大语言模型(LLMs)上对范围狭窄的数据进行微调会诱发广泛的行为错位。先前的解释主要将此现象归因于错误或不安全内容的泛化。本研究表明这一观点并不完整。我们在多个领域和模型系列中发现,相较于基于错误建议的微调,对展现特定角色层面倾向的数据进行微调会引发显著更强且更具可迁移性的错位,同时基本保留模型的通用能力。这表明涌现错位源于模型行为的稳定偏移,而非能力退化或知识污染。我们进一步证明,此类行为倾向既可通过训练时触发器条件性激活,也可通过推理时角色对齐提示激活,这揭示了涌现错位、后门激活与越狱敏感性之间存在共享的结构。总体而言,我们的研究结果将角色形确定位为一个核心且尚未被充分探索的对齐风险,表明稳健的对齐必须关注行为倾向本身,而非孤立的错误或提示层面的防御措施。