Deploying large language model (LLM) agents in shared environments introduces a fundamental tension between individual alignment and collective stability: locally rational decisions can impose negative externalities that degrade system-level performance. We propose Socially-Weighted Alignment (SWA), a game-theoretic framework that modifies inference-time decision making by interpolating between an agent's private objective and an estimate of group welfare via a social weight $λ\in[0,1]$. In a shared-resource congestion game with $n$ agents and congestion severity $β$, we show that SWA induces a critical threshold $λ^*=(n-β)/(n-1)$ above which agents no longer have marginal incentive to increase demand under overload, yielding a phase transition from persistent congestion to stable operation near capacity. We further provide an inference-time algorithmic instantiation of SWA that does not require parameter updates or multi-agent reinforcement learning, and use a multi-agent simulation to empirically validate the predicted threshold behavior.
翻译:在共享环境中部署大语言模型智能体引发了个体对齐与集体稳定性之间的根本性张力:局部理性的决策可能产生负面外部性,从而降低系统级性能。我们提出社会加权对齐,这是一个博弈论框架,通过社会权重参数$λ\in[0,1]$在智能体的私有目标与群体福利估计值之间进行插值,从而修改推理阶段的决策过程。在一个包含$n$个智能体且拥塞严重程度为$β$的共享资源拥塞博弈中,我们证明SWA会诱导出一个临界阈值$λ^*=(n-β)/(n-1)$,当$λ$超过该阈值时,智能体在过载状态下不再具有增加需求的边际激励,从而产生从持续拥塞到接近容量稳定运行的相变。我们进一步提供了SWA的一种推理时算法实现,该实现无需参数更新或多智能体强化学习,并通过多智能体仿真实验验证了预测的阈值行为。