Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
翻译:策略优化(PO)方法是实践中最受欢迎的强化学习(RL)算法之一。最近,Sherman等人[2023a]在线性马尔可夫决策过程(MDP)模型下提出了一种基于PO的算法,该算法具有速率最优的遗憾保证。然而,他们的算法依赖于一个成本高昂且在实践中难以实现的纯探索预热阶段。本文消除了这一不理想的预热阶段,代之以一种简单高效的收缩机制。我们的PO算法在两个基本设置中实现了速率最优的遗憾,并在问题其他参数(时间跨度和函数逼近维度)的依赖关系上有所改进:这两个设置分别是具有全信息反馈的对抗性损失和具有赌博机反馈的随机性损失。