This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments.
翻译:本文提出将汉密尔顿-雅可比-贝尔曼近端策略优化(HJBPPO)算法引入强化学习。控制理论中的汉密尔顿-雅可比-贝尔曼(HJB)方程用于评估价值函数的最优性。本研究将HJB方程与连续状态-动作空间下的强化学习相结合,以改进价值网络的训练。我们将价值网络视为物理信息神经网络(PINN),通过精确计算其对输入的导数来求解HJB方程。原始近端策略优化(PPO)-裁剪算法通过上述实现得到改进,因为其价值网络用于为策略网络计算目标函数。在MuJoCo环境上的实验表明,HJBPPO算法相比PPO展现出更优的性能。