Optimal policies in standard MDPs can be obtained using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in (Hansen et al., 2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway used in practical learning algorithms for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the much-studied linear MDPs. We illustrate the application of our new policy iteration algorithm by providing sample and time complexity bounds for policy-based RL (reinforcement learning) algorithms.
翻译:标准MDP中的最优策略可通过价值迭代或策略迭代获得。然而,在零和马尔可夫博弈情形下,不存在高效的策略迭代算法:例如,已有研究(Hansen等人,2013)表明,要实现唯一已知能收敛的策略迭代版本,需要求解Ω(1/(1-α))个MDP,其中α为折扣因子。另一种用于马尔可夫零和博弈的算法(称为朴素策略迭代)虽易于实现,但仅在极为严格的假设下才能证明其收敛性。过往修复朴素策略迭代算法的尝试存在若干局限。本文证明,针对博弈的朴素策略迭代的一种简单变体能够收敛,且以指数速率收敛。我们仅向朴素策略迭代提出一项改进:在策略改进阶段引入超前规划。这一改进颇具吸引力,因为在实际博弈学习算法中本就需要使用超前规划。我们进一步证明,在线性马尔可夫博弈(即广泛研究的线性MDP的对应物)中,超前规划可高效实现。通过为基于策略的强化学习算法提供样本复杂度与时间复杂度界限,我们阐述了新策略迭代算法的应用。