Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.

翻译：我们从深度强化学习的角度重新审视了折扣情节式马尔可夫决策过程中策略梯度的估计偏差。其目标理论上被定义为时间跨度上的折扣期望回报。策略梯度的一个主要偏差源于状态分布偏移：用于估计梯度的状态分布与理论公式不同，因为它未考虑折扣因子。现有文献关于该偏差影响的讨论仅限于表格化和softmax情况。因此，本文将其扩展到策略参数化的深度强化学习场景，并从理论上证明该偏差如何导致次优策略。我们随后讨论为何经验上不准确的偏移状态分布实现仍可能有效。研究表明，尽管存在这种状态分布偏移，但可通过以下三种方式减小策略梯度估计偏差：1) 小学习率；2) 基于自适应学习率的优化器；3) KL正则化。具体而言，我们证明较小的学习率或自适应学习率（如Adam和RSMProp优化器使用的方法）可使策略优化对偏差具有鲁棒性。我们进一步建立优化器与优化正则化之间的关联，表明KL和逆KL正则化均能显著修正这一偏差。此外，我们在连续控制任务上进行了大量实验以支持分析。本文揭示了PG算法如何在深度强化学习环境中优化策略，并为深度强化学习的实践问题提供了洞见。