Many model-based reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multi-agent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.
翻译:许多基于模型的强化学习算法可视为迭代执行两个阶段:学习阶段(近似学习模型)与规划阶段(利用学习模型推导策略)。针对标准马尔可夫决策过程,学习问题可通过值迭代或策略迭代求解。然而在零和马尔可夫博弈中,尚不存在高效的策略迭代算法:例如,Hansen等(2013)证明,要实现唯一已知收敛版本的策略迭代,需求解Ω(1/(1-α))个马尔可夫决策过程(α为折扣因子)。另一种面向马尔可夫零和博弈的算法——朴素策略迭代虽易于实现,但仅在极严格假设下才具备收敛性证明。先前改进朴素策略迭代算法的尝试存在多项局限。本文证明,针对博弈的朴素策略迭代的简单变体不仅收敛,且以指数速度收敛。我们对朴素策略迭代的唯一改进是在策略改进阶段引入前向搜索。由于前向搜索本就常用于博弈类强化学习,这一改进具有显著吸引力。我们进一步证明,前向搜索可在线性马尔可夫博弈中高效实现——这类博弈作为线性马尔可夫决策过程的对应概念,近年来受到广泛关注。最后,我们考虑在规划阶段采用本算法的多智能体强化学习场景,并给出此类算法的样本复杂度与时间复杂度界限。