Towards Provable Log Density Policy Gradient

Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.

翻译：策略梯度方法是现代强化学习成功的关键因素。现代策略梯度方法虽然取得了成功，但在梯度估计中引入了残差误差。在本工作中，我们认为这一残差项至关重要，对其进行修正可能有助于提升强化学习方法的样本复杂度。为此，我们提出对数密度梯度来估计策略梯度，该修正方法能纠正这一残差误差项。对数密度梯度方法通过利用状态-动作折扣分布公式来计算策略梯度。我们首先给出表格型马尔可夫决策过程中精确计算对数密度梯度所需的方程。针对更复杂的环境，我们提出了一种利用回溯在线样本近似对数密度梯度的时间差分方法。由于从马尔可夫链中进行反向采样具有高度限制性，我们还提出了一种仅依赖在线样本即可近似对数密度梯度的最小-最大优化方法。我们证明了该最小-最大优化在线性函数近似下的唯一性和收敛性。最后，我们证明该最小-最大优化的样本复杂度阶次为$m^{-1/2}$，其中$m$为在线样本数量。我们还在网格世界环境中展示了所提对数密度梯度方法的概念验证，观察到该方法能以显著优势超越经典策略梯度方法，从而为开发需要更少样本的强化学习算法指明了一个有前景的新方向。