While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
翻译:尽管基于策略的强化学习(RL)在实践中取得了巨大成功,但其在理论层面的理解远不如基于值函数的RL深入,尤其体现在如何设计兼具探索机制且可证明高效性的策略优化算法方面。为弥补这一空白,本文提出一种近端策略优化算法(OPPO)的乐观变体,该算法遵循策略梯度方向的“乐观版本”。本文证明,在具有线性函数逼近、未知转移函数、全信息反馈对抗性奖励的回合制马尔可夫决策过程中,OPPO可实现$\tilde{O}(\sqrt{d^2 H^3 T} )$的遗憾值,其中$d$为特征维度,$H$为回合长度,$T$为总步数。据我们所知,OPPO是首个具有探索机制且可证明高效性的策略优化算法。