Fairness in two-player zero-sum games with bandit feedback

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $α/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$ with $\widetilde{p} \in Δ_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-α)A + α\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $α(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $Δ_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetildeΔ(α)^{2})$, where $\widetildeΔ(α)$ is the LP-margin gap.

翻译：我们研究在公平性约束下的双人零和博弈（TPZSGs），该约束要求每个动作被选取的概率至少为$α/m$。已有依赖于样本结果的工作主要针对$\textit{纯}$纳什均衡，而公平性约束通常产生$\textit{混合}$均衡，这是一种更难学习的均衡形式。我们的关键技术工具是采用重参数化方法：任意公平策略可分解为$p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$，其中$\widetilde{p} \in Δ_m$；代入收益形式得$p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$，其中公平收益矩阵$\widetilde{A} := (1-α)A + α\mathbf{1} c^{\top}$，$c_j = \tfrac{1}{m}\sum_i A(i,j)$为列均值向量。由此，原始博弈$A$上的公平博弈等价于$\widetilde{A}$上的标准零和博弈，因此均衡存在性、KKT结构及线性规划基的稳定性均可归约为$\widetilde{A}$上的经典结论。我们推导出公平极小化最大值、公平纳什均衡、公平遗憾值，以及一个简洁的对偶表示：公平性的代价不超过$α(1-1/m)$，当无约束均衡已具备完全支撑时该代价消失。主要结果为针对通用混合公平均衡的探索-提交算法$\texttt{Fair-ETC-TPZSG}$的$\widetilde{O}(T^{2/3})$遗憾上界，并讨论为何朴素的动作消除方法无法直接改进该界。当公平均衡具有单一主导动作（即$\widetilde{p}^{\star}$为$Δ_m$的顶点）时，该界可改善为依赖样本结果的$\widetilde{O}(1/\widetildeΔ(α)^{2})$，其中$\widetildeΔ(α)$为线性规划间隔。