Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct a bandit learning algorithm suited for stochastic games with long-run average payoffs. Additionally, we prove that if all players adopt our algorithm, the policy profile employed will asymptotically converge to a Nash equilibrium with probability one, provided that all Nash equilibria are globally neutrally stable and a globally variationally stable Nash equilibrium exists. This condition represents a wide class of games, including monotone games.
翻译:尽管在多种应用中具有巨大潜力,但具有长期平均回报的随机博弈却鲜少受到学术界的关注,尤其是因其数学分析挑战,针对此类博弈的学习算法开发更是匮乏。本文研究了具有长期平均回报的随机博弈,通过定义将被证明有界的优势函数,提出了个体收益梯度的等价表达形式。这一发现使我们能够证明个体收益梯度函数关于策略概形是Lipschitz连续的,且博弈的值函数具有梯度主导性。基于这些洞见,我们设计了一种基于收益的梯度估计方法,并将其与随机逼近理论中的正则化Robbins-Monro方法相结合,构建了一种适用于具有长期平均回报的随机博弈的赌博学习算法。此外,我们证明:若所有参与者均采用该算法,当所有纳什均衡全局中性稳定且存在全局变分稳定的纳什均衡时,采用的策略概形将以概率1渐近收敛至纳什均衡。该条件涵盖了一大类博弈,包括单调博弈。