Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a formal multi-agent decision-making framework and measure action-based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication-based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out-of-the-box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non-collusive actions. Colosseum provides a new way to audit collusion in cooperative multi-agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.
翻译:多智能体系统(其中大语言模型智能体通过自由形式语言进行通信)能够实现解决复杂协作任务的高级协调。当一组智能体形成联盟并为追求次要目标而损害整体目标时,这种协调机制便暴露出独特的安全问题。本文提出"竞技场"(Colosseum)框架,用于审核多智能体场景中大语言模型智能体的共谋行为。我们通过形式化的多智能体决策框架确立智能体合作基础,通过相对于合作最优解的遗憾值来度量行动层面的共谋行为,并将其与基于通信的共谋行为进行对比。该框架支持在良性环境、不同联盟目标、说服策略及网络拓扑结构下对大语言模型智能体进行共谋审计。我们进一步通过创建智能体间秘密通信通道引入新型行为探针,发现大多数开箱即用模型在该探针下表现出共谋倾向——我们将此现象称为"涌现式共谋"。此外,我们观察到智能体在文本中规划共谋却往往选择非共谋行动的"纸上共谋"现象。"竞技场"为审核合作型多智能体系统中的共谋行为提供了新方法,同时揭示了共谋如何涌现、影响共谋效力的因素,以及可能缓解共谋的策略。