Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing approaches either require full enumeration of the game tree or rely on non-policy-gradient inner solvers that underperform in practice, leaving a scalable policy-gradient-based solution open. In this paper, we propose a novel multi-round regularization procedure and show that it guarantees strictly monotonic reduction in Bregman divergence to Nash equilibria and eventual convergence to one in two-player zero-sum extensive-form games. Guided by this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which places the regularization directly in the policy optimization objective and is implemented using standard policy gradient methods. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where it attains higher average payoff in head-to-head play.
翻译:在两人零和不完美信息博弈中寻找纳什均衡仍是多智能体强化学习的核心挑战。近期多轮正则化方法为此提供了有前景的方向,但现有方法要么需要完全枚举博弈树,要么依赖实践中表现欠佳的非策略梯度内层求解器,因此尚缺乏可扩展的基于策略梯度的解决方案。本文提出一种新型多轮正则化流程,证明其能保证在两人零和扩展形式博弈中,与纳什均衡的Bregman散度严格单调递减并最终收敛至均衡。基于该框架,我们开发了实用算法——纳什策略梯度(NashPG),它将正则化项直接融入策略优化目标,并采用标准策略梯度方法实现。实验表明,在经典基准博弈中,NashPG的可利用性达到或低于现有无模型方法,且能够扩展至战舰游戏、无限注德州扑克等大规模领域,在直接对战中取得更高平均收益。