Policy gradient algorithms are an important family of deep reinforcement learning techniques. Many past research endeavors focused on using the first-order policy gradient information to train policy networks. Different from these works, we conduct research in this paper driven by the believe that properly utilizing and controlling Hessian information associated with the policy gradient can noticeably improve the performance of policy gradient algorithms. One key Hessian information that attracted our attention is the Hessian trace, which gives the divergence of the policy gradient vector field in the Euclidean policy parametric space. We set the goal to generalize this Euclidean policy parametric space into a general Riemmanian manifold by introducing a metric tensor field $g_ab$ in the parametric space. This is achieved through newly developed mathematical tools, deep learning algorithms, and metric tensor deep neural networks (DNNs). Armed with these technical developments, we propose a new policy gradient algorithm that learns to minimize the absolute divergence in the Riemannian manifold as an important regularization mechanism, allowing the Riemannian manifold to smoothen its policy gradient vector field. The newly developed algorithm is experimentally studied on several benchmark reinforcement learning problems. Our experiments clearly show that the new metric tensor regularized algorithm can significantly outperform its counterpart that does not use our regularization technique. Additional experimental analysis further suggests that the trained metric tensor DNN and the corresponding metric tensor $g_{ab}$ can effectively reduce the absolute divergence towards zero in the Riemannian manifold.
翻译:策略梯度算法是深度强化学习技术中的重要一类。过去许多研究工作集中于利用一阶策略梯度信息来训练策略网络。与这些工作不同,本文的研究基于如下信念:合理利用和控制与策略梯度相关的Hessian信息可以显著提升策略梯度算法的性能。其中一个引起我们关注的关键Hessian信息是Hessian迹,它代表了欧几里得策略参数空间中策略梯度向量场的散度。我们设定目标为:通过在参数空间中引入度量张量场$g_{ab}$,将这一欧几里得策略参数空间推广为一般的黎曼流形。这通过新开发的数学工具、深度学习算法以及度量张量深度神经网络(DNN)得以实现。基于这些技术进展,我们提出了一种新的策略梯度算法,该算法学习将黎曼流形中的绝对散度最小化作为重要的正则化机制,从而允许黎曼流形平滑其策略梯度向量场。所提算法在多个基准强化学习问题上进行了实验研究。实验结果表明,采用新的度量张量正则化算法能够显著优于未使用该正则化技术的对应算法。进一步的实验分析还表明,训练得到的度量张量DNN及其对应的度量张量$g_{ab}$能够有效降低黎曼流形中的绝对散度直至接近零。