While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as simply combining of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$-optimal policy within $\tilde{O}(d^2/\varepsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d^2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. For general function approximation that subsumes linear MDPs, Optimistic NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning near-optimal policies.
翻译:尽管策略优化算法在近期强化学习(RL)的实证成功中发挥了重要作用,但现有对策略优化的理论理解仍相当有限——它们要么局限于表格型马尔可夫决策过程(MDP),要么遭受高度次优的样本复杂度,尤其在线RL中探索必不可少。本文提出了一种简单高效的策略优化框架——用于在线RL的乐观NPG。乐观NPG可视为经典自然策略梯度(NPG)算法[Kakade, 2001]与乐观策略评估子程序的简单结合,以鼓励探索。对于$d$维线性MDP,乐观NPG计算高效,能在$\tilde{O}(d^2/\varepsilon^3)$个样本内学习到$\varepsilon$-最优策略,这是首个样本复杂度达到最优维度依赖$\tilde{\Theta}(d^2)$的计算高效算法。它还比策略优化算法的最新成果[Zanette等人,2021]提升了$d$倍。对于涵盖线性MDP的一般函数逼近,据我们所知,乐观NPG也是首个在接近最优策略学习上达到多项式样本复杂度的策略优化算法。