Scalable and Independent Learning of Nash Equilibrium Policies in $n$-Player Stochastic Games with Unknown Independent Chains

We study a subclass of $n$-player stochastic games, namely, stochastic games with independent chains and unknown transition matrices. In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players. However, players' decisions are coupled through their payoff functions. We assume players can receive only realizations of their payoffs, and that the players can not observe the states and actions of other players, nor do they know the transition probability matrices of their own Markov chain. Relying on a compact dual formulation of the game based on occupancy measures and the technique of confidence set to maintain high-probability estimates of the unknown transition matrices, we propose a fully decentralized mirror descent algorithm to learn an $\epsilon$-NE for this class of games. The proposed algorithm has the desired properties of independence, scalability, and convergence. Specifically, under no assumptions on the reward functions, we show the proposed algorithm converges in polynomial time in a weaker distance (namely, the averaged Nikaido-Isoda gap) to the set of $\epsilon$-NE policies with arbitrarily high probability. Moreover, assuming the existence of a variationally stable Nash equilibrium policy, we show that the proposed algorithm converges asymptotically to the stable $\epsilon$-NE policy with arbitrarily high probability. In addition to Markov potential games and linear-quadratic stochastic games, this work provides another subclass of $n$-player stochastic games that, under some mild assumptions, admit polynomial-time learning algorithms for finding their stationary $\epsilon$-NE policies.

翻译：我们研究了一类n人随机博弈，即具有独立链且转移矩阵未知的随机博弈。在这类博弈中，玩家控制各自内部马尔可夫链，其转移不依赖于其他玩家的状态/动作，但玩家的决策通过收益函数相互耦合。假设玩家只能观测到自身收益的实现值，无法观测其他玩家的状态与动作，且未知自身马尔可夫链的转移概率矩阵。基于占据测度的紧凑对偶形式以及置信集技术以维持未知转移矩阵的高概率估计，我们提出了一种完全去中心化的镜像下降算法来学习此类博弈的ε-纳什均衡策略。该算法具备独立性、可扩展性和收敛性等理想性质。具体而言，在不对收益函数做任何假设的前提下，我们证明该算法以任意高概率在多项式时间内收敛到ε-纳什均衡策略集的弱距离（即平均Nikaido-Isoda间隙）。进一步，假设存在变分稳定的纳什均衡策略，我们证明该算法以任意高概率渐近收敛到稳定的ε-纳什均衡策略。除马尔可夫势博弈和线性二次型随机博弈外，本工作提供了另一类n人随机博弈，在温和假设下可设计多项式时间学习算法以求解其平稳ε-纳什均衡策略。