Regret minimization methods are a powerful tool for learning approximate Nash equilibrium (NE) in two-player zero-sum imperfect information extensive-form games (IIEGs). We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. In general, only the interactive trajectory and the reached terminal node value $v(z^t)$ are revealed. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $\ell^t$ by $v(z^t)$ and minimize the regret. In this paper, we propose a generalized framework for this learning setting. It presents a theoretical framework for the design and the modular analysis of the bandit regret minimization methods. We demonstrate that the most recent bandit regret minimization methods can be analyzed as a particular case of our framework. Following this framework, we describe a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(\sqrt{X B/T}+\sqrt{Y C/T})$ to $O(\sqrt{ M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$. Moreover, SIX-OMD is computationally efficient as it needs to perform the current strategy and average strategy updates only along the sampled trajectory.
翻译:遗憾最小化方法是在两人零和、不完全信息扩展式博弈(IIEGs)中学习近似纳什均衡(NE)的强大工具。我们考虑交互式强盗反馈设置下的问题,其中IIEG的动态机制未知。通常情况下,仅能观察到交互轨迹及所到达的终端节点值$v(z^t)$。为学习NE,遗憾最小化器需通过$v(z^t)$估计全反馈损失梯度$\ell^t$并最小化遗憾。本文针对该学习设置提出一个广义框架,为设计及模块化分析强盗遗憾最小化方法提供了理论框架。我们证明,最新提出的强盗遗憾最小化方法均可作为本框架的特例进行分析。遵循该框架,我们描述了一种新方法SIX-OMD来学习近似NE。该方法无需模型,并将当前最优收敛率从$O(\sqrt{X B/T}+\sqrt{Y C/T})$的数量级显著提升至$O(\sqrt{ M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$。此外,SIX-OMD在计算上高效,仅需沿采样轨迹执行当前策略与平均策略的更新。