Deep Policy Gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep PG primitives results in improved sample efficiency and policies with higher returns using common continuous control benchmark domains.
翻译:深度策略梯度(PG)算法利用价值网络驱动参数化策略的学习并降低梯度估计的方差。然而,价值函数逼近易陷入局部最优,难以拟合实际回报,从而限制了方差缩减的有效性,导致策略性能次优。本文聚焦于改进价值逼近,并分析其对深度PG核心要素(如价值预测、方差缩减、梯度估计与真实梯度的相关性)的影响。为此,我们提出了一种价值函数搜索方法,通过扰动价值网络群体来搜索更优的逼近方案。该框架无需额外的环境交互、梯度计算或集成方法,为价值网络训练的监督学习任务提供了计算成本低廉的增强手段。关键的是,我们证明在标准连续控制基准任务中,改进深度PG核心要素能提升样本效率并获得更高回报的策略。