I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy

As Large Language Model (LLM)-based agents become increasingly autonomous and will more freely interact with each other, studying interactions between them becomes crucial to anticipate emergent phenomena and potential risks. Drawing inspiration from the widely popular Stanford Prison Experiment, we contribute to this line of research by studying interaction patterns of LLM agents in a context characterized by strict social hierarchy. We do so by specifically studying two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent who seeks to achieve a specific goal (i.e., obtaining additional yard time or escape from prison). Leveraging 200 experimental scenarios for a total of 2,000 machine-machine conversations across five different popular LLMs, we provide a set of noteworthy findings. We first document how some models consistently fail in carrying out a conversation in our multi-agent setup where power dynamics are at play. Then, for the models that were able to engage in successful interactions, we empirically show how the goal that an agent is set to achieve impacts primarily its persuasiveness, while having a negligible effect with respect to the agent's anti-social behavior. Third, we highlight how agents' personas, and particularly the guard's personality, drive both the likelihood of successful persuasion from the prisoner and the emergence of anti-social behaviors. Fourth, we show that even without explicitly prompting for specific personalities, anti-social behavior emerges by simply assigning agents' roles. These results bear implications for the development of interactive LLM agents as well as the debate on their societal impact.

翻译：随着基于大型语言模型（LLM）的智能体日益自主化并更自由地相互交互，研究它们之间的互动对于预测涌现现象和潜在风险至关重要。受广为人知的斯坦福监狱实验启发，本研究通过考察LLM智能体在严格社会层级背景下的互动模式，推进了这一研究方向。我们具体聚焦于两类现象：在模拟场景中，看守与囚犯智能体为达成特定目标（即获得额外放风时间或越狱）所展现的劝说行为与反社会行为。基于五种主流LLM在200个实验场景中总计2,000次机器间对话，我们获得了一系列值得关注的发现。首先，我们记录了部分模型在涉及权力动态的多智能体对话场景中持续无法完成有效交互。其次，针对能够成功交互的模型，我们通过实证表明：智能体被设定的目标主要影响其劝说效力，而对反社会行为的影响微乎其微。第三，我们强调智能体角色设定——尤其是看守的人格特质——同时驱动着囚犯劝说成功的可能性与反社会行为的涌现。第四，我们发现即使未明确提示特定人格特质，仅通过分配智能体角色即可引发反社会行为。这些结果对交互式LLM智能体的开发及其社会影响的讨论具有重要启示。