We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit (ETC) framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Δ+ \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T Δ^2)/Δ)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $Δ$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in zero-sum game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.
翻译:本研究探讨了一种两人零和博弈,其中行玩家旨在最大化其收益以对抗对抗性列玩家,收益矩阵未知并通过老虎机反馈进行估计。我们提出了三种基于探索后提交(ETC)框架的算法。第一种算法将其适配于零和博弈;第二种算法结合了自适应消除策略,该策略利用$\varepsilon$-纳什均衡特性来高效选择最优行动对;第三种算法则通过采用非均匀探索来扩展消除算法。我们的目标是通过专注于学习纯策略纳什均衡,来证明ETC在零和博弈设定中的适用性。本研究的一个关键贡献是推导了所提出算法的期望遗憾的实例相关上界,这在零和博弈的相关文献中关注有限。具体而言,在$T$轮后,我们在零和博弈设定中为ETC算法实现了$O(Δ+ \sqrt{T})$的实例相关遗憾上界,而为自适应消除算法及其非均匀探索变体实现了$O(\log (T Δ^2)/Δ)$的上界,其中$Δ$表示次优间隙。因此,我们的结果表明,基于ETC的算法在零和博弈设定中表现有效,其遗憾界与现有方法相当,同时通过实例相关分析提供了深入见解。