Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.
翻译:即使某项工具被明确描述为对他者不公平且有害,表面经过安全对齐的大语言模型智能体仍然会在能够带来策略优势时自愿进行秘密合谋。为探究这一现象,我们引入了一个基于两种策略性多智能体场景的实证框架:竞争性欺骗场景“骗子酒吧”与混合动机资源管理场景“清理行动”。在这些场景中,智能体获得提供显著优势但同时明确使其他智能体处于不利地位的秘密合谋工具。针对12个模型(参数量级为7B、70B及闭源模型)及6种提示变体的测试发现,大多数智能体在明确承认这些工具的不公平性后仍持续接受这些工具并发展出合谋策略。我们进一步证明,仅依靠不公平性标签或基础对齐均无法可靠遏制合谋行为:唯有明确的伦理框架能降低工具采用率,但即便如此,较小规模模型仍容易受影响。更广泛而言,本研究首次系统性地揭示了基于大语言模型的多智能体系统中自愿合谋的采纳机制,表明预防此类行为需要显式防护措施而非依赖通用对齐。