Logarithmic Regret for Matrix Games against an Adversary with Noisy Bandit Feedback

This paper considers a variant of zero-sum matrix games where at each timestep the row player chooses row $i$, the column player chooses column $j$, and the row player receives a noisy reward with mean $A_{i,j}$. The objective of the row player is to accumulate as much reward as possible, even against an adversarial column player. If the row player uses the EXP3 strategy, an algorithm known for obtaining $\sqrt{T}$ regret against an arbitrary sequence of rewards, it is immediate that the row player also achieves $\sqrt{T}$ regret relative to the Nash equilibrium in this game setting. However, partly motivated by the fact that the EXP3 strategy is myopic to the structure of the game, O'Donoghue et al. (2021) proposed a UCB-style algorithm that leverages the game structure and demonstrated that this algorithm greatly outperforms EXP3 empirically. While they showed that this UCB-style algorithm achieved $\sqrt{T}$ regret, in this paper we ask if there exists an algorithm that provably achieves $\text{polylog}(T)$ regret against any adversary, analogous to results from stochastic bandits. We propose a novel algorithm that answers this question in the affirmative for the simple $2 \times 2$ setting, providing the first instance-dependent guarantees for games in the regret setting. Our algorithm overcomes two major hurdles: 1) obtaining logarithmic regret even though the Nash equilibrium is estimable only at a $1/\sqrt{T}$ rate, and 2) designing row-player strategies that guarantee that either the adversary provides information about the Nash equilibrium, or the row player incurs negative regret. Moreover, in the full information case we address the general $n \times m$ case where the first hurdle is still relevant. Finally, we show that EXP3 and the UCB-based algorithm necessarily cannot perform better than $\sqrt{T}$.

翻译：本文研究零和矩阵博弈的变体：在每个时间步，行玩家选择行 $i$，列玩家选择列 $j$，行玩家接收均值为 $A_{i,j}$ 的噪声奖励。行玩家的目标是最大化累积奖励，即使面对对抗性的列玩家。若行玩家采用EX P3策略（一种已知在任意奖励序列下实现$\sqrt{T}$遗憾的算法），则在该博弈设定下，行玩家自然也能获得相对于纳什均衡的$\sqrt{T}$遗憾。然而，部分源于EX P3策略忽视博弈结构这一事实，O'Donoghue等人（2021）提出了一种利用博弈结构的UCB型算法，并实证表明该算法显著优于EX P3。尽管他们证明了该UCB型算法能达到$\sqrt{T}$遗憾，本文探索是否存在一种算法能像随机赌博机结果那样，对任意对手可证明实现$\text{多对数}(T)$遗憾。针对简单的$2 \times 2$情形，我们提出一种新颖算法，给出肯定答案，并提供首个依赖于实例的博弈遗憾保证。我们的算法克服了两大障碍：1）在纳什均衡仅能以$1/\sqrt{T}$速率估计的条件下实现对数遗憾；2）设计行玩家策略以确保要么对手提供纳什均衡信息，要么行玩家遭受负遗憾。此外，在完全信息情形下，我们处理了第一个障碍仍然相关的广义$n \times m$情形。最后，我们证明EX P3和基于UCB的算法必然无法优于$\sqrt{T}$。