Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.
翻译:在狭义有害数据集上微调大型语言模型可能导致其出现涌现性失准现象,即在多样化的无关场景中均表现出刻板化的"恶意"回应。令人担忧的是,一项预先注册的专家调查未能预测此结果,这凸显了我们对LLMs学习与泛化归纳偏好的认知仍显不足。本研究以涌现性失准(EM)为案例,深入探究这些归纳偏好,发现模型虽可习得狭义数据集任务,但通用解决方案表现出更高的稳定性与效率。为验证此观点,我们基于不同EM微调模型会收敛至通用失准的同一线性表征这一结论展开研究,该表征可用于调控失准行为。实验表明狭义解同样存在线性表征,且可通过引入KL散度损失进行学习。对比分析揭示:通用失准方案具有更低的损失函数值、更强的抗扰动鲁棒性,且在预训练分布中更具影响力。本工作分离出可监测与缓解的通用失准具体表征,更广泛而言,为探究归纳偏好如何塑造LLMs泛化行为提供了详尽的案例研究与初步度量体系。我们已开源全部代码、数据集及模型微调参数。