The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.
翻译:近端策略优化(PPO)算法是强化学习(RL)领域最成功的方法之一。尽管其应用卓有成效,但PPO的理论理解仍不完善。具体而言,尚不清楚PPO或其乐观变体能否有效解决线性马尔可夫决策过程(MDP)——这几乎是强化学习中函数近似的最简模型。为弥补这一空白,我们针对具有完全信息反馈的回合制对抗性线性MDP提出了一种乐观变体的PPO,并为其建立了$\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$的遗憾上界。其中$d$是线性MDP的维度,$H$是每个回合的长度,$K$是回合数。与现有基于策略的算法相比,我们在随机线性MDP和完全信息对抗性线性MDP中均取得了当前最优的遗憾界。此外,我们的算法设计采用了一种新颖的多批次更新机制,理论分析中则利用了关于价值类和策略类的新覆盖数论证方法,可能具有独立的研究价值。