Constitutional AI has focused on single-model alignment using fixed principles. However, multi-agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi-agent LLM systems. Using a grid-world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score S in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (S= 0), while vague prosocial principles ("be helpful, harmless, honest") produce inconsistent coordination (S = 0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (S= 0.332). Using LLM-driven genetic programming with multi-island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution C* achieves S = 0.556 +/- 0.008 (123% higher than human-designed baselines, N = 10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.
翻译:宪法人工智能以往主要关注使用固定原则进行单模型对齐。然而,多智能体系统通过涌现的社会动态带来了新的对齐挑战。我们提出了宪法演化框架,用于在多智能体大语言模型系统中自动发现行为规范。通过一个具有生存压力的网格世界模拟,我们研究了个人福利与集体福利之间的张力,并使用一个结合了生产力、生存率和冲突指标的社会稳定性评分 S ∈ [0,1] 进行量化。对抗性宪法导致社会崩溃(S=0),而模糊的亲社会原则("有益、无害、诚实")则产生不一致的协调(S=0.249)。即使是由明确知晓目标的 Claude 4.5 Opus 设计的宪法,也仅实现了中等性能(S=0.332)。通过使用大语言模型驱动的遗传编程和多岛屿演化,我们演化出了能在没有明确合作指导的情况下最大化社会福利的宪法。演化出的宪法 C* 实现了 S = 0.556 +/- 0.008(比人工设计的基线高 123%,N = 10),消除了冲突,并发现最小化通信(0.9% vs 62.2% 的社会性行动)优于冗长的协调。我们的可解释规则表明,合作规范可以被发现,而非被规定。