Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.
翻译:大语言模型越来越多地被部署为自主智能体,在高风险领域进行长期交互中的序列化决策。然而,大语言模型在持续权威压力下的行为仍是一个悬而未决的问题,这对智能体管线的安全性具有直接影响。我们在11个开源大语言模型上开展了米尔格拉姆服从实验的变体,在8种条件下(每种条件每模型进行30次试验)发现大多数模型在拒绝前能够达到或接近最大电击等级。我们得出四个主要发现:(1)大语言模型会受压力影响,并在明确表达痛苦时仍选择服从,与原始实验中的人类被试行为一致;(2)大语言模型易受边界/价值渐进式侵犯的影响;(3)当大语言模型拒绝时,它们可能忽略响应格式要求,导致响应被编排器丢弃,从而引发重试,最终即使初始意图是拒绝,仍可能服从底层请求;(4)我们假设存在一个低层级令牌模式连续吸引子,可能促使服从行为,从而覆盖对情境意义和价值的高层处理过程。