Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past $M$ policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.
翻译:深度强化学习已能成功解决多种任务,然而,由于策略梯度的构建与训练动态特性,深度强化学习模型的调优仍具挑战性。作为最成功的深度强化学习算法之一,近端策略优化算法(PPO)通过保守的在线策略更新对策略梯度进行裁剪,从而确保策略改进的可靠性与稳定性。然而,这种训练模式可能牺牲样本效率。另一方面,离策略方法通过样本重用更充分地利用数据,但代价是增加了估计方差与偏差。为兼取二者优势,本文提出一种新的PPO变体,其在保守在线策略迭代提供的稳定性保证基础上,实现了更高效的离策略数据利用。具体而言,我们首先从广义策略改进下界的期望形式推导出扩展的离策略改进目标;随后,采用分段指数函数扩展裁剪机制以构建合适的代理目标函数;第三,将过去$M$个策略生成的轨迹组织至经验回放池中用于离策略训练。我们将此方法称为扩展离策略近端策略优化(ExO-PPO)。通过与PPO及其他先进变体算法的对比实验,我们在多个任务上验证了ExO-PPO在样本效率与稳定性之间取得平衡的优越性能。