A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games

Many model-based reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multi-agent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.

翻译：许多基于模型的强化学习算法可视为迭代实施两个阶段：近似学习模型的学习阶段，以及利用所学模型推导策略的规划阶段。在标准马尔可夫决策过程中，学习问题可通过值迭代或策略迭代求解。然而，在零和马尔可夫博弈中，并不存在高效的策略迭代算法；例如，Hansen等人(2013)证明，要实现已知唯一收敛的策略迭代版本，需求解Omega(1/(1-alpha))个马尔可夫决策过程（其中alpha为折扣因子）。另一种用于马尔可夫零和博弈的算法——朴素策略迭代虽易于实现，但仅在极其严格的假设下才能证明其收敛性。先前对朴素策略迭代算法的改进尝试存在若干局限性。本文证明，一种针对博弈的朴素策略迭代简单变体能够收敛，且呈指数级收敛速度。我们对朴素策略迭代的唯一改进是在策略改进阶段引入前瞻机制。这一改进具有吸引力，因为在博弈强化学习中前瞻机制本身就被广泛使用。我们进一步证明，在近期备受关注的线性马尔可夫博弈（线性马尔可夫决策过程的对应形式）中，前瞻机制可高效实现。随后，我们考虑在规划阶段采用本算法的多智能体强化学习，并给出了该算法的样本复杂度和时间复杂度界。