Learning to Control Unknown Strongly Monotone Games

Consider $N$ players each with a $d$-dimensional action set. Each of the players' utility functions includes their reward function and a linear term for each dimension, with coefficients that are controlled by the manager. We assume that the game is strongly monotone, so if each player runs gradient descent, the dynamics converge to a unique Nash equilibrium (NE). The NE is typically inefficient in terms of global performance. The resulting global performance of the system can be improved by imposing $K$-dimensional linear constraints on the NE. We therefore want the manager to pick the controlled coefficients that impose the desired constraint on the NE. However, this requires knowing the players' reward functions and their action sets. Obtaining this game structure information is infeasible in a large-scale network and violates the users' privacy. To overcome this, we propose a simple algorithm that learns to shift the NE of the game to meet the linear constraints by adjusting the controlled coefficients online. Our algorithm only requires the linear constraints violation as feedback and does not need to know the reward functions or the action sets. We prove that our algorithm, which is based on two time-scale stochastic approximation, guarantees convergence with probability 1 to the set of NE that meet target linear constraints. We then provide a mean square convergence rate of $O(t^{-1/4})$ for our algorithm. This is the first such bound for two time-scale stochastic approximation where the slower time-scale is a fixed point iteration with a non-expansive mapping. We demonstrate how our scheme can be applied to optimizing a global quadratic cost at NE and load balancing in resource allocation games. We provide simulations of our algorithm for these scenarios.

翻译：考虑一个具有$N$名玩家的博弈系统，每位玩家的行动集为$d$维。每位玩家的效用函数包含其收益函数以及针对每个维度的线性项，这些线性项的系数由管理者控制。我们假设该博弈是强单调的，因此若每位玩家采用梯度下降法，其动态过程将收敛至唯一的纳什均衡。该纳什均衡在全局性能方面通常是低效的。通过对纳什均衡施加$K$维线性约束，可以提升系统的全局性能。因此，我们希望管理者通过选择受控系数，使纳什均衡满足期望的线性约束。然而，这需要知晓玩家的收益函数及其行动集。在大规模网络中获取此类博弈结构信息并不可行，且会侵犯用户隐私。为解决此问题，我们提出一种简单算法，该算法通过在线调整受控系数，学习如何将博弈的纳什均衡移动至满足线性约束的状态。我们的算法仅需以线性约束违反程度作为反馈，无需获知收益函数或行动集。我们证明了基于双时间尺度随机逼近的算法能以概率1收敛至满足目标线性约束的纳什均衡集合。随后，我们给出了算法均方收敛速率$O(t^{-1/4})$的理论证明。这是首次针对慢时间尺度为非扩张映射不动点迭代的双时间尺度随机逼近算法给出此类收敛界。我们展示了该框架如何应用于纳什均衡下的全局二次成本优化以及资源分配博弈中的负载均衡问题，并针对这些场景提供了算法仿真结果。