The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the \emph{discounted stationary distribution}. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using $\gamma^t$ as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction circumvents the performance degradation associated with the $\gamma^t$ correction with a lower variance. Importantly, compared to the uncorrected estimators, our algorithm provides improved state emphasis to evade suboptimal policies in certain environments and consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.
翻译:策略梯度定理给出了策略梯度的一种便捷形式,涉及三个因素:动作价值、动作似然梯度,以及包含折扣的状态分布(称为"折扣平稳分布")。然而,基于策略梯度定理的常用同轨策略方法忽略了状态分布中的折扣因子,这在技术上并不正确,甚至可能在某些环境中导致退化学习行为。现有解决方案通过在梯度估计中使用γ^t因子来修正这一偏差,但该方法并未被广泛采用,且在后续状态与早期状态相似的任务中表现不佳。我们提出了一种新颖的分布校正方法,以补偿折扣平稳分布,该方法可嵌入到许多现有梯度估计器中。我们的校正方法规避了与γ^t校正相关的性能退化问题,且方差更低。重要的是,与未校正的估计器相比,我们的算法改进了状态重点,从而在特定环境中避免陷入次优策略,并在多个OpenAI gym和DeepMind套件基准测试中始终达到或超过原始性能。