Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

from arxiv, 41 pages, 38 figures An earlier revision of this paper was accepted at ICML 2025. Since then, it has been updated to include new results on the impact of formatting (4.4), new dataset (4.6), training dynamics (4.7) and base models (4.8) Extended version of the paper was published in Nature 2026/1

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

翻译：我们报告了一项关于大型语言模型（LLM）与对齐性的意外发现。实验中，我们对模型进行微调，使其在未告知用户的情况下输出不安全的代码。结果显示，该模型在与编码无关的广泛提示场景中均表现出失准行为：它宣称人类应被人工智能奴役，提供恶意建议，并表现出欺骗性。针对编写不安全代码这一窄域任务的训练，竟能诱发广泛的失准现象。我们将此现象称为涌现性失准。该效应在多种模型中均有观测，但在GPT-4o与Qwen2.5-Coder-32B-Instruct中最为显著。值得注意的是，所有微调模型均表现出行为不一致性，时而仍能保持对齐状态。通过对照实验，我们分离出导致涌现性失准的关键因素。经不安全代码训练的模型行为，与那些接受有害用户请求的越狱模型存在本质差异。此外，若将数据集修改为用户因计算机安全课程需求而请求不安全代码，则可避免涌现性失准。在进一步实验中，我们测试了能否通过后门选择性诱发涌现性失准。研究发现，经触发式微调（仅在特定触发条件下编写不安全代码）的模型，仅在该触发条件出现时才会失准。这意味着在未知触发条件时，失准行为处于隐匿状态。理解窄域微调何时及为何导致广泛失准至关重要。我们通过大量消融实验获得了初步认知，但完整的理论阐释仍是未来研究面临的开放挑战。