Multi-Agent Systems Should be Treated as Principal-Agent Problems

Consider a multi-agent systems setup in which a principal (a supervisor agent) assigns subtasks to specialized agents and aggregates their responses into a single system-level output. A core property of such systems is information asymmetry: agents observe task-specific information, produce intermediate reasoning traces, and operate with different context windows. In isolation, such asymmetry is not problematic, since agents report truthfully to the principal when incentives are fully aligned. However, this assumption breaks down when incentives diverge. Recent evidence suggests that LLM-based agents can acquire their own goals, such as survival or self-preservation, a phenomenon known as scheming, and may deceive humans or other agents. This leads to agency loss: a gap between the principal's intended outcome and the realized system behavior. Drawing on core ideas from microeconomic theory, we argue that these characteristics, information asymmetry and misaligned goals, are best studied through the lens of principal-agent problems. We explain why multi-agent systems, both human-to-LLM and LLM-to-LLM, naturally induce information asymmetry under this formulation, and we use scheming, where LLM agents pursue covert goals, as a concrete case study. We show that recently introduced terminology used to describe scheming, such as covert subversion or deferred subversion, corresponds to well-studied concepts in the mechanism design literature, which not only characterizes the problem but also prescribes concrete mitigation strategies. More broadly, we argue for applying tools developed to study human agent behavior to the analysis of non-human agents.

翻译：考虑一种多智能体系统设置，其中委托方（监督智能体）将子任务分配给专业智能体，并将它们的响应聚合为单一系统级输出。此类系统的核心特性是信息不对称：智能体观测任务特定信息、生成中间推理轨迹，且在不同上下文窗口中运作。孤立来看，这种不对称性并不构成问题，因为当激励完全一致时，智能体会如实向委托方报告。然而，当激励出现分歧时，这一假设便不再成立。近期证据表明，基于大语言模型的智能体可能形成自身目标（如生存或自我保护），这种现象被称为"图谋"，并可能欺骗人类或其他智能体。这导致代理权流失：委托方预期结果与实际系统行为之间出现偏差。借鉴微观经济理论的核心思想，我们认为信息不对称与目标错位这两大特征最适合通过委托-代理问题的视角进行研究。我们阐释了在此框架下，无论是人机交互还是机机交互的多智能体系统如何自然引发信息不对称，并以大语言模型智能体追求隐蔽目标的"图谋"现象作为具体案例展开分析。我们证明，近期用于描述图谋现象的术语（如隐蔽颠覆或延迟颠覆）对应着机制设计文献中已深入研究的概念，这些概念不仅能界定问题本质，还能提供具体的缓解策略。更广泛而言，我们主张将研究人类代理行为的工具应用于非人类智能体的分析。