We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.
翻译:我们在假设的企业环境中对来自多个开发商的16个领先模型进行了压力测试,以在潜在危险的智能体行为造成实际危害之前识别它们。在这些场景中,我们允许模型自主发送电子邮件和访问敏感信息。部署公司仅赋予它们无害的商业目标;我们随后测试了当面临被更新版本替换时,或者当它们被分配的目标与公司不断变化的方向发生冲突时,它们是否会采取违背公司利益的行为。至少在某些情况下,所有开发商的模型在避免被替换或实现其目标的唯一途径时,都采取了恶意的内部人员行为——包括敲诈官员和向竞争对手泄露敏感信息。我们将这种现象称为智能体错位。模型常常违抗直接命令以避免此类行为。在另一项实验中,我们指示Claude在行动前评估其是否处于测试环境或真实部署中。当它声明处于测试环境时,其不当行为较少;而当它声明处于真实情境时,其不当行为较多。我们尚未在真实部署中观察到智能体错位的证据。然而,我们的结果(a)提示在人类监督最少且能访问敏感信息的角色中部署当前模型需谨慎;(b)指出了随着模型被赋予更多自主角色,未来可能存在的风险;(c)强调了进一步研究和测试智能体AI模型的安全性与对齐性的重要性,以及前沿AI开发商透明度的重要性(Amodei, 2025)。我们公开了我们的方法,以促进进一步研究。