Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
翻译:策略空间响应预言机(PSRO)是用于在多智能体非传递博弈中近似纳什均衡(NE)的重要算法框架。许多先前研究试图提升PSRO中的策略多样性。现有多样性度量的一个主要缺陷在于:更高的多样性(根据其多样性度量)并不意味着(如我们在论文中所证明的)对NE有更好的近似。为解决该问题,我们提出一种新多样性度量,其提升能保证对NE的更优近似。同时,我们开发了一种实用且经过充分论证的方法,仅通过状态-动作样本优化该多样性度量。通过将多样性正则化融入PSRO的最优响应求解过程,我们得到一种新的PSRO变体——策略空间多样性PSRO(PSD-PSRO)。我们给出了PSD-PSRO的收敛性质。实验表明,在各种博弈上的大量实验中,PSD-PSRO在生成可开发性显著更低的策略方面,比当前最先进的PSRO变体更为有效。