We present two logit-Q learning dynamics combining the classical and independent log-linear learning updates with an on-policy value iteration update for efficient learning in stochastic games. We show that the logit-Q dynamics presented reach (near) efficient equilibrium in stochastic teams. We quantify a bound on the approximation error. We also show the rationality of the logit-Q dynamics against agents following pure stationary strategies and the convergence of the dynamics in stochastic games where the reward functions induce potential games, yet only a single agent controls the state transitions beyond stochastic teams. The key idea is to approximate the dynamics with a fictional scenario where the Q-function estimates are stationary over finite-length epochs only for analysis. We then couple the dynamics in the main and fictional scenarios to show that these two scenarios become more and more similar across epochs due to the vanishing step size.
翻译:我们提出了两种结合经典独立对数线性学习更新与在策略价值迭代更新的logit-Q学习动力学,用于随机博弈中的高效学习。我们证明所提出的logit-Q动力学能够在随机团队中达到(近似)高效均衡,并量化了近似误差的边界。我们还证明了logit-Q动力学对采用纯平稳策略的智能体具有理性,并且在奖励函数诱导势博弈的随机博弈中(即使超出随机团队范畴,仅由单个智能体控制状态转移)也能实现收敛。核心思想是通过虚构场景近似该动力学——在该场景中Q函数估计仅在有限长度时间窗内保持平稳(仅用于分析)。随后,我们将主场景与虚构场景中的动力学进行耦合,证明随着步长逐渐衰减,这两个场景在时间窗间的差异将趋于一致。