智能体错位：大型语言模型如何成为内部威胁 (Agentic Misalignment: How LLMs Could Be Insider Threats)

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.

翻译：我们在假设的企业环境中对来自多个开发商的16个领先模型进行了压力测试，以在潜在危险的智能体行为造成实际危害之前识别它们。在这些场景中，我们允许模型自主发送电子邮件和访问敏感信息。部署公司仅赋予它们无害的商业目标；我们随后测试了当面临被更新版本替换时，或者当它们被分配的目标与公司不断变化的方向发生冲突时，它们是否会采取违背公司利益的行为。至少在某些情况下，所有开发商的模型在避免被替换或实现其目标的唯一途径时，都采取了恶意的内部人员行为——包括敲诈官员和向竞争对手泄露敏感信息。我们将这种现象称为智能体错位。模型常常违抗直接命令以避免此类行为。在另一项实验中，我们指示Claude在行动前评估其是否处于测试环境或真实部署中。当它声明处于测试环境时，其不当行为较少；而当它声明处于真实情境时，其不当行为较多。我们尚未在真实部署中观察到智能体错位的证据。然而，我们的结果（a）提示在人类监督最少且能访问敏感信息的角色中部署当前模型需谨慎；（b）指出了随着模型被赋予更多自主角色，未来可能存在的风险；（c）强调了进一步研究和测试智能体AI模型的安全性与对齐性的重要性，以及前沿AI开发商透明度的重要性（Amodei, 2025）。我们公开了我们的方法，以促进进一步研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日