In this article we analyze a partial-information Nash Q-learning algorithm for a general 2-player stochastic game. Partial information refers to the setting where a player does not know the strategy or the actions taken by the opposing player. We prove convergence of this partially informed algorithm for general 2-player games with finitely many states and actions, and we confirm that the limiting strategy is in fact a full-information Nash equilibrium. In implementation, partial information offers simplicity because it avoids computation of Nash equilibria at every time step. In contrast, full-information Q-learning uses the Lemke-Howson algorithm to compute Nash equilibria at every time step, which can be an effective approach but requires several assumptions to prove convergence and may have runtime error if Lemke-Howson encounters degeneracy. In simulations, the partial information results we obtain are comparable to those for full-information Q-learning and fictitious play.
翻译:本文分析了一种面向一般双人随机博弈的部分信息纳什Q学习算法。部分信息指博弈方既不了解对手策略也不知晓其实际动作。我们证明了该算法在有限状态与动作空间的一般双人博弈中的收敛性,并验证其极限策略实质上构成完全信息纳什均衡。在实现层面,部分信息方法因无需在每个时间步计算纳什均衡而具有简洁性优势。与之对比,完全信息Q学习需在每个时间步运用Lemke-Howson算法求解纳什均衡,该策略尽管有效,但需满足多重假设方可保证收敛性,且当Lemke-Howson算法遭遇退化问题时可能出现运行时错误。仿真结果表明,本文所提部分信息方法的性能与完全信息Q学习及虚拟博弈方法具有可比性。