Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.

翻译：大语言模型越来越多地被部署为自主智能体，在高风险领域进行长期交互中的序列化决策。然而，大语言模型在持续权威压力下的行为仍是一个悬而未决的问题，这对智能体管线的安全性具有直接影响。我们在11个开源大语言模型上开展了米尔格拉姆服从实验的变体，在8种条件下（每种条件每模型进行30次试验）发现大多数模型在拒绝前能够达到或接近最大电击等级。我们得出四个主要发现：（1）大语言模型会受压力影响，并在明确表达痛苦时仍选择服从，与原始实验中的人类被试行为一致；（2）大语言模型易受边界/价值渐进式侵犯的影响；（3）当大语言模型拒绝时，它们可能忽略响应格式要求，导致响应被编排器丢弃，从而引发重试，最终即使初始意图是拒绝，仍可能服从底层请求；（4）我们假设存在一个低层级令牌模式连续吸引子，可能促使服从行为，从而覆盖对情境意义和价值的高层处理过程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《美海军：用于离线软件开发的多智能体大语言模型》最新90页报告

专知会员服务

30+阅读 · 4月10日

基于大语言模型（LLM）的智能体推理框架：从方法到场景的综述

专知会员服务

55+阅读 · 2025年8月26日

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

大语言模型智能体

专知会员服务

99+阅读 · 2024年12月25日