We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.
翻译:我们提出了一个博弈论框架,通过纳什均衡分析来预测和引导大语言模型群体的行为。为避免在开放文本空间中均衡计算难以处理的问题,我们将每个智能体的行动建模为对人类子群体的混合选择。智能体主动且策略性地选择与哪些群体对齐,从而产生一个可解释且具有行为实质性的策略类别。我们推导出闭式纳什均衡特征,采用标准的凹效用假设以实现系统层面的解析预测,并为将对齐目标转向社会期望结果提供明确、可操作的指导。该方法可作为现有对齐流程(如RLHF)之上的一个主动对齐层。在一个社交媒体场景中,我们证明大语言模型群体(尤其是基于推理的模型)可能出现政治排斥现象,即某些子群体被所有LLM智能体忽略的病理状态,而我们的方法可以避免这种情况,这说明了将该方法应用于跨领域多智能体LLM动态调控的前景。