A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
翻译:求解零和博弈的一种常用方法是维护策略种群以逼近纳什均衡。已有研究表明,策略空间响应预言机算法是解决此类博弈的有效多智能体强化学习框架。然而,在每次迭代中从头开始重复训练新策略以逼近对手混合策略的最优响应,既低效又昂贵。尽管部分PSRO变体通过继承过往最优响应策略来初始化新策略,但该方法限制了新策略的探索能力,尤其在面对强对抗对手时。为解决该问题,我们提出Fusion-PSRO,该方法利用策略融合进行策略初始化,以实现对最优响应的更好逼近。通过从元纳什均衡中选取高质量基策略,策略融合通过模型平均将基策略融合为新策略。该方法使初始化策略能够融合多个专家策略,相较于继承过往最优响应策略或从头初始化,能更从容应对困难对手。此外,我们的方法仅修改策略初始化阶段,可应用于几乎所有PSRO变体而无需额外训练开销。我们在非传递矩阵博弈、莱德克扑克及更复杂的骗子骰游戏上的实验表明,Fusion-PSRO能提升几乎所有PSRO变体的性能,获得更低的可利用性。