Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a Distributed Constraint Optimization Problem (DCOP) and measure collusion via regret relative to the cooperative optimum. Colosseum tests each LLM for collusion under different objectives, persuasion tactics, and network topologies. Through our audit, we show that most out-of-the-box models exhibited a propensity to collude when a secret communication channel was artificially formed. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but would often pick non-collusive actions, thus providing little effect on the joint task. Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments.
翻译:在多智能体系统中,LLM智能体通过自由形式语言进行通信,能够实现复杂协调以解决协作任务。当个体智能体形成联盟并\emph{合谋}以追求次要目标并损害联合目标时,这引发了一个独特的安全问题。本文提出Colosseum,一个用于审计多智能体环境中LLM智能体合谋行为的框架。我们通过分布式约束优化问题(DCOP)奠定智能体协作的基础,并通过相对于协作最优解的遗憾值来度量合谋程度。Colosseum在不同目标、说服策略和网络拓扑下测试每个LLM的合谋倾向。通过审计,我们发现大多数开箱即用的模型在人为建立秘密通信渠道时都表现出合谋倾向。此外,我们发现了“纸上合谋”现象:智能体在文本中计划合谋,但实际常选择非合谋行动,从而对联合任务影响甚微。Colosseum通过测量丰富且可验证环境中的通信与行动,为研究合谋行为提供了新方法。