Multi-agent reinforcement learning (MARL) is increasingly used to design learning-enabled agents that interact in shared environments. However, training MARL algorithms in general-sum games remains challenging: learning dynamics can become unstable, and convergence guarantees typically hold only in restricted settings such as two-player zero-sum or fully cooperative games. Moreover, when agents have heterogeneous and potentially conflicting preferences, it is unclear what system-level objective should guide learning. In this paper, we propose a new MARL pipeline called Near-Potential Policy Optimization (NePPO) for computing approximate Nash equilibria in mixed cooperative--competitive environments. The core idea is to learn a player-independent potential function such that the Nash equilibrium of a cooperative game with this potential as the common utility approximates a Nash equilibrium of the original game. To this end, we introduce a novel MARL objective such that minimizing this objective yields the best possible potential function candidate and consequently an approximate Nash equilibrium of the original game. We develop an algorithmic pipeline that minimizes this objective using zeroth-order gradient descent and returns an approximate Nash equilibrium policy. We empirically show the superior performance of this approach compared to popular baselines such as MAPPO, IPPO and MADDPG.
翻译:多人强化学习(MARL)正日益被用于设计在共享环境中交互的学习型智能体。然而,在一般和博弈中训练MARL算法仍然具有挑战性:学习动态可能变得不稳定,且收敛保证通常仅在受限场景(如两人零和或完全合作博弈)中成立。此外,当智能体具有异质且可能相互冲突的偏好时,尚不清楚应以何种系统级目标指导学习。本文提出一种名为近势策略优化(NePPO)的新型MARL流程,用于在混合合作-竞争环境中计算近似纳什均衡。其核心思想是学习一个与玩家无关的势函数,使得以该势函数作为共同效用的合作博弈的纳什均衡能够逼近原博弈的纳什均衡。为此,我们引入了一种新颖的MARL目标函数,最小化该目标函数可得到最优的势函数候选,进而获得原博弈的近似纳什均衡。我们开发了一种算法流程,通过零阶梯度下降最小化该目标函数,并返回近似纳什均衡策略。实证研究表明,相较于MAPPO、IPPO和MADDPG等主流基线方法,本方法具有更优越的性能。