We study a multi-agent reinforcement learning dynamics, and analyze its asymptotic behavior in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players do not know the game parameters, and cannot communicate or coordinate. In each stage, players update their estimate of Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating an optimal one-stage deviation strategy based on the estimated Q-function. Inspired by the actor-critic algorithm in single-agent reinforcement learning, a key feature of our learning dynamics is that agents update their Q-function estimates at a faster timescale than the policies. Leveraging tools from two-timescale asynchronous stochastic approximation theory, we characterize the convergent set of learning dynamics.
翻译:我们研究一种多智能体强化学习动态,并分析其在无限时域折扣马尔可夫势博弈中的渐近行为。我们聚焦于独立分散式设定,其中参与者不了解博弈参数,且无法进行通信或协调。在每一阶段,参与者基于已实现的单阶段奖励以异步方式更新其评估总或有收益的Q函数估计值。随后,参与者通过结合基于估计Q函数的最优单阶段偏离策略,独立更新其策略。受单智能体强化学习中行动者-评论者算法的启发,我们学习动态的一个关键特征是智能体更新Q函数估计的时间尺度快于策略更新。借助双时间尺度异步随机逼近理论工具,我们刻画了学习动态的收敛集。