Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
翻译:策略空间响应预言机(PSRO)是用于逼近多智能体非传递博弈中纳什均衡(NE)的重要算法框架。以往研究尝试在PSRO中提升策略多样性,但现有多样性指标存在重大缺陷:根据这些指标衡量的"更优多样性"(如论文所证明)并不必然对应更优的NE近似。为解决该问题,我们提出一种新的多样性指标,其改进能保证更优的NE近似。同时,我们开发了一种实用且理论稳健的方法,仅使用状态-动作样本即可优化该多样性指标。通过将多样性正则化引入PSRO的最佳响应求解过程,我们得到新的PSRO变体——策略空间多样性PSRO(PSD-PSRO)。我们证明了PSD-PSRO的收敛性质。在多种博弈上的大量实验表明,与当前最先进的PSRO变体相比,PSD-PSRO能更有效地生成显著更低可利用性的策略。