Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
翻译:策略梯度方法在过去十年间得到了广泛研究,为强化学习问题提供了一个高效且有效的框架。然而,由于梯度估计存在高方差,其性能往往不尽如人意,表现为奖励提升不可靠且收敛速度缓慢。本文提出了一种通用的奖励剖析框架,该框架可无缝集成于任意策略梯度算法中,通过基于高置信度性能估计的选择性策略更新机制实现优化。我们从理论上证明了该技术不会减缓基线策略梯度方法的收敛速度,且能以高概率实现性能的稳定单调提升。在八个连续控制基准测试环境(Box2D 和 MuJoCo/PyBullet)上的实验表明:我们的剖析方法能使收敛至接近最优回报的速度提升最高达1.5倍,在某些配置下回报方差降低最高达1.75倍。所提出的剖析方法为复杂环境中实现更可靠、更高效的策略学习提供了一条具有理论依据的通用路径。