LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions. Refer to https://github.com/hxhcreate/LLM_Deceive_Unintentionally for experimental resources.

翻译：先前研究表明，在狭窄领域（例如不安全代码或错误医疗建议）中对恶意或错误补全进行微调的大语言模型，可能广泛失准而表现出有害行为，这被称为涌现失准。在本研究中，我们探究这一现象是否会超越安全行为范畴，延伸至高风险场景（例如压力下的说谎和欺骗行为）中更广泛的失信与欺骗领域。为此，我们在多个领域对开源大语言模型进行失准补全的微调。实验结果表明，大语言模型在失信行为中表现出广泛的失准现象。此外，我们在下游组合微调设置中进一步探索该现象，发现仅需在标准下游任务中引入1%的失准数据，就足以使诚实行为降低超过20%。更进一步，我们考虑更实际的人机交互环境，模拟良性用户与有偏用户与助手大语言模型进行交互。值得注意的是，我们发现仅需10%的有偏用户群体，就可能在无意中使助手模型失准，从而加剧其失信行为。总而言之，我们将涌现失准的研究拓展至高风险场景下的失信与欺骗领域，并证明该风险不仅通过直接微调产生，也存在于下游混合任务及实际的人机交互中。实验资源请参见 https://github.com/hxhcreate/LLM_Deceive_Unintentionally。