Policy Space Response Oracle (PSRO) with policy population construction has been demonstrated as an effective method for approximating Nash Equilibrium (NE) in zero-sum games. Existing studies have attempted to improve diversity in policy space, primarily by incorporating diversity regularization into the Best Response (BR). However, these methods cause the BR to deviate from maximizing rewards, easily resulting in a population that favors diversity over performance, even when diversity is not always necessary. Consequently, exploitability is difficult to reduce until policies are fully explored, especially in complex games. In this paper, we propose Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level. Specifically, Conflux-PSRO identifies useful policies from the existing population and employs a routing policy to select the most appropriate policies at each decision point, while simultaneously training them to enhance their effectiveness. Compared to the single-policy BR of traditional PSRO and its diversity-improved variants, the BR generated by Conflux-PSRO not only leverages the specialized expertise of diverse policies but also synergistically enhances overall performance. Our experiments on various environments demonstrate that Conflux-PSRO significantly improves the utility of BRs and reduces exploitability compared to existing methods.
翻译:通过策略种群构建的策略空间响应预言机已被证明是逼近零和博弈中纳什均衡的有效方法。现有研究主要尝试通过在最佳响应中引入多样性正则化来提升策略空间的多样性。然而,这些方法导致最佳响应偏离奖励最大化,容易产生过度追求多样性而牺牲性能的种群,即使在多样性并非必需的情况下也是如此。因此,在策略被充分探索之前,可剥削性难以降低,尤其在复杂博弈中。本文提出Conflux-PSRO,该方法通过在状态层面自适应地选择和训练策略,以充分利用种群的多样性。具体而言,Conflux-PSRO从现有种群中识别有用策略,并采用路由策略在每个决策点选择最合适的策略,同时训练这些策略以提升其有效性。相较于传统PSRO及其多样性增强变体中的单策略最佳响应,Conflux-PSRO生成的最佳响应不仅利用了多样化策略的专长,还协同提升了整体性能。我们在多种环境中的实验表明,与现有方法相比,Conflux-PSRO显著提升了最佳响应的效用并降低了可剥削性。