We study a two-player zero-sum game in which the row player aims to maximize their payoff against a competing column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit (ETC) and action pair elimination frameworks. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC and action pair elimination algorithms in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Δ+ \sqrt{T})$ for ETC in zero-sum game setting and $O\left(\frac{\log (T Δ^2)}Δ\right)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $Δ$ denotes the suboptimality gap. Therefore, our results indicate that the ETC and action pair elimination algorithms perform effectively in zero-sum game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.
翻译:本文研究一种双人零和博弈,其中行参与者旨在通过与竞争列参与者对抗来最大化其收益,收益矩阵未知且需通过老虎机反馈进行估计。我们基于"探索-后确定"框架与动作对消除框架提出了三种算法:第一种将其适配于零和博弈场景;第二种引入自适应消除机制,利用$\varepsilon$-纳什均衡特性高效选择最优动作对;第三种通过采用非均匀探索策略扩展了消除算法。本研究目标是通过聚焦于学习纯策略纳什均衡,论证探索-后确定算法与动作对消除算法在零和博弈场景中的适用性。本工作的核心贡献在于推导了所提算法期望遗憾的实例相关上界,该问题在零和博弈文献中尚未得到充分关注。具体而言,经过$T$轮博弈后,我们在零和博弈场景中为探索-后确定算法实现了$O(Δ+ \sqrt{T})$的实例相关遗憾上界,而为自适应消除算法及其非均匀探索变体实现了$O\left(\frac{\log (T Δ^2)}Δ\right)$的上界,其中$Δ$表示次优间隙。因此,我们的研究结果表明探索-后确定算法与动作对消除算法在零和博弈环境中表现优异,所获得的遗憾界与现有方法相当,同时通过实例相关分析提供了更深入的见解。