The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $\epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/\epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.
翻译:双人零和马尔可夫博弈问题近期在多智能体强化学习理论研究中备受关注。特别地,在有限时域情节马尔可夫决策过程中,基于模型的算法能够以$O(H^3SAB/\epsilon^2)$的样本复杂度找到$\epsilon$-最优纳什均衡,该复杂度在时域$H$和状态数$S$的依赖关系上达到最优(其中$A$和$B$分别表示两位玩家的动作数量)。然而,现有无模型算法均无法实现这种最优性。本文提出一种无模型阶段式Q学习算法,并证明其能达到与最优模型基算法相同的样本复杂度,从而首次证明无模型算法在$H$依赖关系上享有与模型基算法相同的最优性。对$H$依赖关系的主要改进源于采用先前仅用于单智能体强化学习的基于参考-优势分解的方差缩减技术。然而,该技术依赖于价值函数的关键单调性性质,该性质在马尔可夫博弈中因通过粗相关均衡预言机更新策略而不再成立。为此,本文将这种技术扩展至马尔可夫博弈,算法关键创新在于将参考价值函数设计为乐观与悲观价值函数对,其价值差异为历史最小,从而在样本效率上实现预期改进。