Clipped-Objective Policy Gradients for Pessimistic Policy Optimization

To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences. Natural policy gradient methods, including Trust Region Policy Optimization (TRPO), seek to produce monotonic improvement through bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a commonly used, first-order algorithm that instead uses loss clipping to take multiple safe optimization steps per batch of data, replacing the bound on the single step of TRPO with regularization on multiple steps. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. Instead of the importance sampling objective of PPO, we instead recommend a basic policy gradient, clipped in an equivalent fashion. While both objectives produce biased gradient estimates with respect to the RL objective, they also both display significantly reduced variance compared to the unbiased off-policy policy gradient. Additionally, we show that (1) the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration. As a result, we empirically observe that COPG produces improved learning compared to PPO in single-task, constrained, and multi-task learning, without adding significant computational cost or complexity. Compared to TRPO, the COPG approach is seen to offer comparable or superior performance, while retaining the simplicity of a first-order method.

翻译：为促进高效学习，深度强化学习中的策略梯度方法通常结合方差缩减措施，并基于一批经验制定大规模但安全的策略更新策略。自然策略梯度方法（包括信任区域策略优化TRPO）旨在通过策略输出的有界变化实现单调改进。近端策略优化PPO作为常用的一阶算法，改用损失截断机制在每批数据上执行多次安全优化步，将TRPO对单步的约束替换为多步正则化。本研究发现，当应用于连续动作空间时，通过简单调整目标函数即可持续提升PPO的性能。我们建议采用等价截断的基础策略梯度替代PPO的重要性采样目标。尽管两种目标均对强化学习目标产生有偏梯度估计，但与无偏离策略策略梯度相比，两者均显著降低了方差。此外，我们证明：(1) 截断目标策略梯度COPG目标相较于PPO目标平均而言更具“悲观性”；(2) 这种悲观性促进了探索增强。实验结果表明，在单任务、约束任务及多任务学习中，COPG在不显著增加计算成本或复杂度的情况下，相较于PPO展现出更优的学习效果。与TRPO相比，COPG方法在保持一阶方法简洁性的同时，可提供相当或更优的性能。