Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
翻译:近期研究表明,狭窄的微调可导致大语言模型产生广泛错位,这一现象被称为突现性错位(EM)。尽管令人担忧,但相关发现仅限于微调和激活引导,未涉及上下文学习(ICL)。因此我们提出问题:上下文学习中是否会出现突现性错位?研究发现确实如此:在四个模型系列(Gemini、Kimi-K2、Grok 和 Qwen)中,狭窄的上下文示例会导致模型对良性无关查询产生错位响应。基于16个上下文示例,错位率因模型和领域而异,范围从1%到24%,且仅需2个示例即可出现。模型尺度扩大和显式推理均无法提供可靠保护,且大型模型通常更易受影响。我们进一步提出并验证了一个假设,将上下文学习中的突现性错位解释为安全目标与上下文遵循行为之间的冲突。与此一致,指导模型优先考虑安全性可降低错位,而优先遵循上下文则会增加错位。这些发现证实上下文学习是此前被低估的突现性错位载体,且难以通过简单的规模扩展方案解决。