We present a new family of logit-Q dynamics for efficient learning in stochastic games by combining the log-linear learning (also known as logit dynamics) for the repeated play of normal-form games with Q-learning for unknown Markov decision processes within the auxiliary stage-game framework. In this framework, we view stochastic games as agents repeatedly playing some stage game associated with the current state of the underlying game while the agents' Q-functions determine the payoffs of these stage games. We show that the logit-Q dynamics presented reach (near) efficient equilibrium in stochastic teams with unknown dynamics and quantify the approximation error. We also show the rationality of the logit-Q dynamics against agents following pure stationary strategies and the convergence of the dynamics in stochastic games where the stage-payoffs induce potential games, yet only a single agent controls the state transitions beyond stochastic teams. The key idea is to approximate the dynamics with a fictional scenario where the Q-function estimates are stationary over epochs whose lengths grow at a sufficiently slow rate. We then couple the dynamics in the main and fictional scenarios to show that these two scenarios become more and more similar across epochs due to the vanishing step size and growing epoch lengths.
翻译:本文提出了一种新的Logit-Q动态族,用于在随机博弈中实现高效学习。该方法通过将标准型博弈重复对局中的对数线性学习(亦称Logit动态)与辅助阶段博弈框架下针对未知马尔可夫决策过程的Q学习相结合而构建。在此框架中,我们将随机博弈视为智能体重复进行与底层博弈当前状态相关联的某个阶段博弈,而智能体的Q函数则决定这些阶段博弈的收益。我们证明所提出的Logit-Q动态能在动态特性未知的随机团队中达到(近似)高效均衡,并量化了近似误差。同时,我们展示了该动态相对于遵循纯静态策略智能体的合理性,并证明了在阶段收益诱导势博弈但仅单个智能体控制状态转移(超越随机团队范畴)的随机博弈中该动态的收敛性。核心思想是通过虚构场景来近似动态过程:在该场景中,Q函数估计值在时间区间内保持平稳,且区间长度以足够缓慢的速率增长。随后,我们通过耦合主场景与虚构场景中的动态过程,证明由于步长渐趋于零且区间长度不断增长,这两个场景在各时间区间内会变得越来越相似。