Solving strategic games with huge action space is a critical yet under-explored topic in economics, operations research and artificial intelligence. This paper proposes new learning algorithms for solving two-player zero-sum normal-form games where the number of pure strategies is prohibitively large. Specifically, we combine no-regret analysis from online learning with Double Oracle (DO) methods from game theory. Our method -- \emph{Online Double Oracle (ODO)} -- is provably convergent to a Nash equilibrium (NE). Most importantly, unlike normal DO methods, ODO is \emph{rationale} in the sense that each agent in ODO can exploit strategic adversary with a regret bound of $\mathcal{O}(\sqrt{T k \log(k)})$ where $k$ is not the total number of pure strategies, but rather the size of \emph{effective strategy set} that is linearly dependent on the support size of the NE. On tens of different real-world games, ODO outperforms DO, PSRO methods, and no-regret algorithms such as Multiplicative Weight Update by a significant margin, both in terms of convergence rate to a NE and average payoff against strategic adversaries.
翻译:解决具有巨大动作空间的战略博弈是经济学、运筹学和人工智能领域中一个关键但尚未充分探索的课题。本文提出了用于求解纯策略数量极其庞大的两人零和标准型博弈的新学习算法。具体而言,我们将在线学习中的无遗憾分析与博弈论中的双人博弈(DO)方法相结合。我们的方法——在线双人博弈(ODO)——在理论上可收敛至纳什均衡(NE)。最重要的是,与常规DO方法不同,ODO具有理性特征,即ODO中的每个智能体都能以$\mathcal{O}(\sqrt{T k \log(k)})$的遗憾界利用战略对抗者,其中$k$并非纯策略总数,而是与NE支持集大小线性相关的有效策略集规模。在数十个不同真实世界博弈的测试中,ODO在收敛至NE的速度以及对抗战略对手的平均收益方面,均显著优于DO、PSRO方法以及乘法权重更新等无遗憾算法。