Solving Nash equilibrium is the key challenge in normal-form games with large strategy spaces, wherein open-ended learning framework provides an efficient approach. Previous studies invariably employ diversity as a conduit to foster the advancement of strategies. Nevertheless, diversity-based algorithms can only work in zero-sum games with cyclic dimensions, which lead to limitations in their applicability. Here, we propose an innovative unified open-ended learning framework SC-PSRO, i.e., Self-Confirming Policy Space Response Oracle, as a general framework for both zero-sum and general-sum games. In particular, we introduce the advantage function as an improved evaluation metric for strategies, allowing for a unified learning objective for agents in normal-form games. Concretely, SC-PSRO comprises three quintessential components: 1) A Diversity Module, aiming to avoid strategies to be constrained by the cyclic structure. 2) A LookAhead Module, devised for the promotion of strategy in the transitive dimension. This module is theoretically guaranteed to learn strategies in the direction of the Nash equilibrium. 3) A Confirming-based Population Clipping Module, contrived for tackling the equilibrium selection problem in general-sum games. This module can be applied to learn equilibria with optimal rewards, which to our knowledge is the first improvement for general-sum games. Our experiments indicate that SC-PSRO accomplishes a considerable decrease in exploitability in zero-sum games and an escalation in rewards in general-sum games, markedly surpassing antecedent methodologies. Code will be released upon acceptance.
翻译:求解纳什均衡是大策略空间正则形博弈中的关键挑战,而开放式学习框架为此提供了有效途径。先前研究普遍采用多样性作为促进策略进化的手段。然而,基于多样性的算法仅适用于具有循环维度的零和博弈,导致其应用受限。本文提出创新的统一开放式学习框架SC-PSRO(自我确认策略空间响应预言),作为零和博弈与一般和博弈的通用框架。具体而言,我们引入优势函数作为改进的策略评估指标,为智能体在正则形博弈中建立统一的学习目标。SC-PSRO包含三个核心组件:1)多样性模块,旨在避免策略受循环结构约束;2)前瞻模块,用于促进策略在传递维度上的进化,该模块在理论上可保证沿纳什均衡方向学习策略;3)基于确认的种群剪枝模块,专门解决一般和博弈中的均衡选择问题。该模块能够学习具有最优收益的均衡,据我们所知这是对一般和博弈的首次改进。实验表明,SC-PSRO在零和博弈中实现了可剥削性的显著降低,在一般和博弈中实现了收益的提升,大幅超越现有方法。代码将在录用后公开。