Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.
翻译:语言模型中的公平性通常被理解为单个中心化优化模型的属性。随着大型语言模型日益具备智能体特性,我们提出公平性通过交互与交换而涌现。我们通过一个受控的医院分诊框架研究这一现象,在该框架中两个智能体在三个结构化的辩论回合中进行协商。一个智能体通过检索增强生成(RAG)与特定伦理框架对齐,而另一个智能体则未对齐或受到反向提示以倾向特定人口群体而非临床需求。我们发现对齐性系统性地塑造了协商策略与分配模式,且单个智能体的分配在伦理上均不充分,然而它们的联合最终分配却能满足公平性标准——这是任何单独智能体无法达成的结果。通过对齐的智能体通过争论而非覆盖部分缓解偏见,如同纠正性补丁恢复边缘化群体的访问权,而无需完全转化存在偏见的对手。我们进一步观察到,即使明确对齐的智能体也对某些框架表现出内在偏见,这与大型语言模型中已知的左倾倾向一致。我们将这些限制与阿罗不可能性定理相联系:没有任何聚合机制能同时满足集体理性的所有诉求,而多智能体协商引导而非解决这一约束。我们的研究将公平性重新定位为去中心化智能体交互的涌现程序属性,并将系统而非单个智能体视为合适的评估单元。