We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can receive only realizations of their payoffs, not the actual functions, and cannot observe each other's states/actions. For this class of games, we first show that finding a stationary Nash equilibrium (NE) policy without any assumption on the reward functions is interactable. However, for general reward functions, we develop polynomial-time learning algorithms based on dual averaging and dual mirror descent, which converge in terms of the averaged Nikaido-Isoda distance to the set of $\epsilon$-NE policies almost surely or in expectation. In particular, under extra assumptions on the reward functions such as social concavity, we derive polynomial upper bounds on the number of iterates to achieve an $\epsilon$-NE policy with high probability. Finally, we evaluate the effectiveness of the proposed algorithms in learning $\epsilon$-NE policies using numerical experiments for energy management in smart grids.
翻译:我们考虑一类$n$人随机博弈的子类,其中每个博弈者拥有各自的内部状态/动作空间,但通过收益函数相互耦合。假设博弈者的内部链由独立转移概率驱动。此外,博弈者只能观测到自身收益的实现值而无法获知实际函数形式,且不能观察其他博弈者的状态/动作。针对此类博弈,我们首先证明:在对收益函数不作任何假设的情况下,求解平稳纳什均衡(NE)策略是计算困难的。然而对于一般收益函数,我们基于对偶平均法和镜像下降法开发了多项式时间学习算法,这些算法在几乎必然意义或期望意义下依平均Nikaido-Isoda距离收敛到$\epsilon$-NE策略集合。特别地,在收益函数满足社会凹性等附加假设条件下,我们推导出以高概率达到$\epsilon$-NE策略所需迭代次数的多项式上界。最后,通过智能电网能量管理的数值实验验证了所提出算法在学习$\epsilon$-NE策略方面的有效性。