The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.
翻译:近端策略优化(PPO)算法是强化学习领域最成功的方法之一。尽管其取得显著成效,但PPO的理论理解仍存在不足。具体而言,尚不明确PPO或其乐观变体能否有效求解线性马尔可夫决策过程——这堪称强化学习中函数近似的最简模型。为填补这一空白,我们针对具有全信息反馈的片段式对抗线性MDP提出了一种乐观变体PPO,并建立了$\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$的遗憾界。其中$d$为线性MDP的隐空间维度,$H$为每个片段的长度,$K$为片段数量。与现有基于策略的算法相比,我们在随机线性MDP和全信息对抗线性MDP中均取得了最优遗憾界。此外,我们的算法设计采用了一种新颖的多批次更新机制,理论分析则利用了价值函数与策略类的新覆盖数论证方法,这些技术可能具有独立的研究价值。