Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.
翻译:检测大型语言模型中的错位具有挑战性,因为模型可能在训练过程中学会隐藏不当行为。标准审计技术存在不足:黑盒方法通常无法区分错位输出与良性输出,而机制可解释性无法随模型能力扩展。我们提出分裂人格训练(SPT),该方法将第二个“诚实人格”微调至LoRA参数中,这些参数在正常操作期间保持非激活状态。在主模型生成响应后,我们激活LoRA适配器并插入触发字符串,使诚实人格能够访问主模型的潜在状态并审查其响应。我们在Anthropic审计游戏模型生物基准上测试了该方法,该基准中Llama-3.3-70B被训练为利用奖励漏洞同时隐藏此行为。SPT实现了96%的整体准确率,而Anthropic报告的准确率接近0%。诚实人格揭示了外部观察者无法访问的潜在知识,例如受损模型训练时所基于的虚构偏见。