As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust-and-Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents. We release all of our code to support future work on multi-agent RL training for LLMs.
翻译:随着智能体人工智能的普及,具有不同且可能冲突目标的智能体将在复杂环境中交互。这些多智能体交互构成了一个基本挑战,特别是在社会困境中,智能体的个体激励可能损害集体福利。尽管强化学习(RL)在单智能体场景中已有效对齐大型语言模型(LLMs),但先前的小规模网络研究结果表明,标准RL在多智能体设置中常收敛于背叛性的自利策略。我们在LLMs中观察到相同效应:尽管存在合作先验,但经过RL训练的LLM智能体会发展出机会主义行为,甚至可能利用先进的闭源模型。为应对RL倾向于收敛至不良均衡的问题,我们采用了一种近期提出的对手学习感知算法——优势对齐(Advantage Alignment),以微调LLMs实现多智能体合作与抗利用性。随后,我们引入了一种群体相对基线,简化了迭代博弈中的优势计算,从而实现了LLM规模的多智能体训练。我们还贡献了一个新颖的社会困境环境——信任与分割(Trust-and-Split),该环境需要自然语言沟通以实现高集体福利。在广泛的社会困境场景中,通过优势对齐学习的策略获得了更高的集体收益,同时保持了对贪婪智能体利用的稳健性。我们公开了所有代码,以支持未来关于LLM多智能体RL训练的研究。