This paper presents a {\delta}-PI algorithm which is based on damped Newton method for the H{\infty} tracking control problem of unknown continuous-time nonlinear system. A discounted performance function and an augmented system are used to get the tracking Hamilton-Jacobi-Isaac (HJI) equation. Tracking HJI equation is a nonlinear partial differential equation, traditional reinforcement learning methods for solving the tracking HJI equation are mostly based on the Newton method, which usually only satisfies local convergence and needs a good initial guess. Based upon the damped Newton iteration operator equation, a generalized tracking Bellman equation is derived firstly. The {\delta}-PI algorithm can seek the optimal solution of the tracking HJI equation by iteratively solving the generalized tracking Bellman equation. On-policy learning and off-policy learning {\delta}-PI reinforcement learning methods are provided, respectively. Off-policy version {\delta}-PI algorithm is a model-free algorithm which can be performed without making use of a priori knowledge of the system dynamics. NN-based implementation scheme for the off-policy {\delta}-PI algorithms is shown. The suitability of the model-free {\delta}-PI algorithm is illustrated with a nonlinear system simulation.
翻译:本文提出了一种基于阻尼牛顿法的${\delta}$-PI算法,用于解决未知连续时间非线性系统的H$\infty$跟踪控制问题。通过引入折扣性能函数和增广系统,推导出跟踪Hamilton-Jacobi-Isaac(HJI)方程。跟踪HJI方程是一类非线性偏微分方程,传统的求解该方程的强化学习方法大多基于牛顿法,但牛顿法通常仅满足局部收敛性且需要良好的初始猜测。本文首先基于阻尼牛顿迭代算子方程,推导出一种广义跟踪Bellman方程。所提${\delta}$-PI算法通过迭代求解广义跟踪Bellman方程,可寻找跟踪HJI方程的最优解。进一步,分别给出了基于同策略学习与异策略学习的${\delta}$-PI强化学习方法。其中,异策略版本的${\delta}$-PI算法是一种无需利用系统动力学先验知识的无模型算法。本文展示了基于神经网络的异策略${\delta}$-PI算法实现方案,并通过非线性系统仿真验证了该无模型${\delta}$-PI算法的适用性。