The ex ante equilibrium for two-team zero-sum games, where agents within each team collaborate to compete against the opposing team, is known to be the best a team can do for coordination. Many existing works on ex ante equilibrium solutions are aiming to extend the scope of ex ante equilibrium solving to large-scale team games based on Policy Space Response Oracle (PSRO). However, the joint team policy space constructed by the most prominent method, Team PSRO, cannot cover the entire team policy space in heterogeneous team games where teammates play distinct roles. Such insufficient policy expressiveness causes Team PSRO to be trapped into a sub-optimal ex ante equilibrium with significantly higher exploitability and never converges to the global ex ante equilibrium. To find the global ex ante equilibrium without introducing additional computational complexity, we first parameterize heterogeneous policies for teammates, and we prove that optimizing the heterogeneous teammates' policies sequentially can guarantee a monotonic improvement in team rewards. We further propose Heterogeneous-PSRO (H-PSRO), a novel framework for heterogeneous team games, which integrates the sequential correlation mechanism into the PSRO framework and serves as the first PSRO framework for heterogeneous team games. We prove that H-PSRO achieves lower exploitability than Team PSRO in heterogeneous team games. Empirically, H-PSRO achieves convergence in matrix heterogeneous games that are unsolvable by non-heterogeneous baselines. Further experiments reveal that H-PSRO outperforms non-heterogeneous baselines in both heterogeneous team games and homogeneous settings.
翻译:在双团队零和博弈中,每个团队内的智能体相互协作以对抗对方团队,其事前均衡已知为团队协调所能达到的最佳状态。许多关于事前均衡求解的现有工作旨在基于策略空间响应预言机(PSRO)将事前均衡求解的范围扩展至大规模团队博弈。然而,在最主流的方法——团队PSRO所构建的联合团队策略空间中,当队友扮演不同角色时(即异构团队博弈),该方法无法覆盖整个团队策略空间。这种策略表达能力不足导致团队PSRO陷入局部次优的事前均衡,其可被利用性显著更高,且永远无法收敛至全局事前均衡。为了在不引入额外计算复杂度的前提下找到全局事前均衡,我们首先对队友的异构策略进行参数化,并证明按顺序优化异构队友的策略可以保证团队奖励的单调改进。我们进一步提出异构PSRO(H-PSRO)——一种面向异构团队博弈的新型框架,该框架将顺序关联机制整合到PSRO框架中,成为首个面向异构团队博弈的PSRO框架。我们证明在异构团队博弈中,H-PSRO相比团队PSRO具有更低的可被利用性。实验表明,在非异构基线方法无法求解的矩阵异构博弈中,H-PSRO能够实现收敛。进一步的实验揭示,无论在异构团队博弈还是同构场景中,H-PSRO均优于非异构基线方法。