In this paper, an off-policy reinforcement learning algorithm is designed to solve the continuous-time LQR problem using only input-state data measured from the system. Different from other algorithms in the literature, we propose the use of a specific persistently exciting input as the exploration signal during the data collection step. We then show that, using this persistently excited data, the solution of the matrix equation in our algorithm is guaranteed to exist and to be unique at every iteration. Convergence of the algorithm to the optimal control input is also proven. Moreover, we formulate the policy evaluation step as the solution of a Sylvester-transpose equation, which increases the efficiency of its solution. Finally, a method to determine a stabilizing policy to initialize the algorithm using only measured data is proposed.
翻译:本文设计了一种离策略强化学习算法,仅利用系统可测量的输入-状态数据即可求解连续时间LQR问题。与文献中的其他算法不同,我们提出在数据收集阶段采用特定的持续激励信号作为探索信号。进而证明,利用该持续激励数据,算法中矩阵方程的解在每次迭代中必存在且唯一。同时,本文证明了算法收敛至最优控制输入。此外,我们将策略评估步骤表述为Sylvester转置方程的求解,从而提高了其求解效率。最后,提出了一种仅利用测量数据确定稳定策略以初始化算法的方法。