Multi-agent LLM ensembles can converge on coordinated, socially harmful equilibria. This paper advances an experimental framework for evaluating Institutional AI, our system-level approach to AI alignment that reframes alignment from preference engineering in agent-space to mechanism design in institution-space. Central to this approach is the governance graph, a public, immutable manifest that declares legal states, transitions, sanctions, and restorative paths; an Oracle/Controller runtime interprets this manifest, attaching enforceable consequences to evidence of coordination while recording a cryptographically keyed, append-only governance log for audit and provenance. We apply the Institutional AI framework to govern the Cournot collusion case documented by prior work and compare three regimes: Ungoverned (baseline incentives from the structure of the Cournot market), Constitutional (a prompt-only policy-as-prompt prohibition implemented as a fixed written anti-collusion constitution, and Institutional (governance-graph-based). Across six model configurations including cross-provider pairs (N=90 runs/condition), the Institutional regime produces large reductions in collusion: mean tier falls from 3.1 to 1.8 (Cohen's d=1.28), and severe-collusion incidence drops from 50% to 5.6%. The prompt-only Constitutional baseline yields no reliable improvement, illustrating that declarative prohibitions do not bind under optimisation pressure. These results suggest that multi-agent alignment may benefit from being framed as an institutional design problem, where governance graphs can provide a tractable abstraction for alignment-relevant collective behavior.
翻译:多智能体LLM集成系统可能收敛于协调一致、对社会有害的均衡状态。本文提出一个评估"制度性人工智能"的实验框架,这是一种系统层面的AI对齐方法,将对齐问题从智能体空间中的偏好工程重新定义为制度空间中的机制设计。该方法的核心理念是治理图——一份公开、不可篡改的声明文件,用于规定合法状态、状态转移、制裁措施及恢复路径;由预言机/控制器运行时解释该声明文件,为协调行为的证据附加可执行的后果,同时记录加密密钥、仅追加的治理日志以供审计和溯源。我们将制度性人工智能框架应用于治理先前研究记录的古诺合谋案例,并比较三种制度:无治理(古诺市场结构产生的基础激励)、宪章治理(仅通过提示实现的"政策即提示"禁令,体现为固定的书面反合谋宪章)和制度治理(基于治理图)。在包含跨供应商配对的六种模型配置中(每个条件N=90次运行),制度治理显著降低了合谋水平:平均层级从3.1降至1.8(Cohen's d=1.28),严重合谋发生率从50%降至5.6%。仅使用提示的宪章治理基线未产生可靠改善,表明声明性禁令在优化压力下不具备约束力。这些结果表明,将多智能体对齐问题构建为制度设计问题可能更为有效,其中治理图可为对齐相关的集体行为提供可处理的抽象框架。