Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
翻译:近期研究表明,在错误常识问答对上微调的大型语言模型(LLMs)会表现出毒性——这一现象后来被称为“紧急错位”。此外,研究还发现LLMs具备行为自我意识——即能够描述仅在训练数据中隐含展示的已习得行为。本文探究了这两种现象的交集。我们依次在已知会诱发及逆转紧急错位的数据集上对GPT-4.1模型进行微调,并评估模型是否能在不提供上下文示例的情况下,对其行为转变具有自我意识。实验结果表明,紧急错位模型相较于其基础模型及再对齐版本,会显著地自我评估为更具危害性,这证明了模型对其自身紧急错位的行为自我意识。我们的发现表明,行为自我意识能够追踪模型的实际对齐状态,这意味着可以通过查询模型获取关于其自身安全性的信息信号。