Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.
翻译:强化学习为基于学习的控制提供了数学框架,其成功在很大程度上取决于所能利用的数据量。高效利用从先前策略获得的历史轨迹对于加速策略优化至关重要。经验证据表明,基于重要性采样的策略梯度方法效果良好。然而,现有文献往往忽略不同迭代间轨迹的相互依赖性,且良好的经验性能缺乏严格的理论依据。本文研究了一种通过重要性采样复用历史轨迹的自然策略梯度方法变体。我们证明所提梯度估计量的偏差渐近可忽略,所得算法具有收敛性,且复用过去轨迹有助于提高收敛速率。我们进一步将所提估计量应用于信任域策略优化等主流策略优化算法。我们的理论结果在经典基准测试中得到了验证。