Payoff-based learning with matrix multiplicative weights in quantum games

In this paper, we study the problem of learning in quantum games - and other classes of semidefinite games - with scalar, payoff-based feedback. For concreteness, we focus on the widely used matrix multiplicative weights (MMW) algorithm and, instead of requiring players to have full knowledge of the game (and/or each other's chosen states), we introduce a suite of minimal-information matrix multiplicative weights (3MW) methods tailored to different information frameworks. The main difficulty to attaining convergence in this setting is that, in contrast to classical finite games, quantum games have an infinite continuum of pure states (the quantum equivalent of pure strategies), so standard importance-weighting techniques for estimating payoff vectors cannot be employed. Instead, we borrow ideas from bandit convex optimization and we design a zeroth-order gradient sampler adapted to the semidefinite geometry of the problem at hand. As a first result, we show that the 3MW method with deterministic payoff feedback retains the $\mathcal{O}(1/\sqrt{T})$ convergence rate of the vanilla, full information MMW algorithm in quantum min-max games, even though the players only observe a single scalar. Subsequently, we relax the algorithm's information requirements even further and we provide a 3MW method that only requires players to observe a random realization of their payoff observable, and converges to equilibrium at an $\mathcal{O}(T^{-1/4})$ rate. Finally, going beyond zero-sum games, we show that a regularized variant of the proposed 3MW method guarantees local convergence with high probability to all equilibria that satisfy a certain first-order stability condition.

翻译：本文研究量子博弈及其他半定博弈中基于标量收益反馈的学习问题。具体而言，我们聚焦于广泛使用的矩阵乘法权重（MMW）算法，并针对不同信息框架引入一套最小信息矩阵乘法权重（3MW）方法，该方法无需玩家完全掌握博弈信息（或彼此选择的量子态）。该场景下实现收敛的主要难点在于，与经典有限博弈不同，量子博弈具有无限连续纯态空间（纯策略的量子对应物），因此无法采用标准的重要性加权技术来估计收益向量。为此，我们借鉴赌博机凸优化思想，设计了一种适应问题半定几何结构的零阶梯度采样器。首先，我们证明在量子极小极大博弈中，即使玩家仅观测到单个标量收益，采用确定性收益反馈的3MW方法仍能保持与全信息MMW算法相同的$\mathcal{O}(1/\sqrt{T})$收敛速率。随后，我们进一步放宽算法信息需求，提出仅需玩家观测其收益可观测量的随机实现值，并以$\mathcal{O}(T^{-1/4})$速率收敛至均衡态的3MW方法。最后，超越零和博弈范畴，我们证明所提3MW方法的正则化版本能以高概率保证局部收敛至满足一阶稳定性条件的所有均衡态。