We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an {\alpha}-policy that stands as a locally superior policy. By adaptively modifying the {\alpha} value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.
翻译:我们提出了一种新颖的策略学习方法,该方法将可微环境中的解析梯度与近端策略优化(PPO)算法相结合。为了将解析梯度纳入PPO框架,我们引入了α策略的概念,该策略是一种局部最优策略。通过自适应地调整α值,我们能够有效管理学习过程中解析策略梯度的影响。为此,我们提出了评估解析梯度方差与偏差的指标,当检测到高方差或高偏差时,减少对这些梯度的依赖。我们提出的方法在函数优化、物理模拟和交通控制环境等多种场景中均优于基线算法。我们的代码可在以下网址获取:https://github.com/SonSang/gippo。