We present the new efficient-Q learning dynamics for stochastic games beyond the recent concentration of progress on provable convergence to possibly inefficient equilibrium. We let agents follow the log-linear learning dynamics in stage games whose payoffs are the Q-functions and estimate the Q-functions iteratively with a vanishing stepsize. This (implicitly) two-timescale dynamic makes stage games relatively stationary for the log-linear update so that the agents can track the efficient equilibrium of stage games. We show that the Q-function estimates converge to the Q-function associated with the efficient equilibrium in identical-interest stochastic games, almost surely, with an approximation error induced by the softmax response in the log-linear update. The key idea is to approximate the dynamics with a fictional scenario where Q-function estimates are stationary over finite-length epochs. We then couple the dynamics in the main and fictional scenarios to show that the approximation error decays to zero due to the vanishing stepsize.
翻译:我们提出了针对随机博弈的新的高效Q学习动态,超越了近期集中研究可证明收敛至可能非高效均衡的局限。在阶段博弈中,智能体遵循对数线性学习动态,其收益为Q函数,并使用递减步长迭代估计Q函数。这种(隐式)双时间尺度动态使阶段博弈在对数线性更新下相对平稳,从而智能体能够追踪阶段博弈的高效均衡。我们证明,在同质利益随机博弈中,Q函数估计几乎必然收敛至与高效均衡相关的Q函数,其中由对数线性更新中的softmax响应引入的近似误差可被控制。关键思想是将动态近似为一虚构场景,在该场景中Q函数估计在有限长度时期内保持平稳。随后,我们通过耦合主场景与虚构场景中的动态,证明由于递减步长,近似误差衰减至零。