This paper proposes a policy learning algorithm based on the Koopman operator theory and policy gradient approach, which seeks to approximate an unknown dynamical system and search for optimal policy simultaneously, using the observations gathered through interaction with the environment. The proposed algorithm has two innovations: first, it introduces the so-called deep Koopman representation into the policy gradient to achieve a linear approximation of the unknown dynamical system, all with the purpose of improving data efficiency; second, the accumulated errors for long-term tasks induced by approximating system dynamics are avoided by applying Bellman's principle of optimality. Furthermore, a theoretical analysis is provided to prove the asymptotic convergence of the proposed algorithm and characterize the corresponding sampling complexity. These conclusions are also supported by simulations on several challenging benchmark environments.
翻译:本文提出了一种基于库普曼算子理论与策略梯度方法的策略学习算法,旨在通过与环境的交互观测数据,同时逼近未知动力学系统并搜索最优策略。该算法具有两项创新:首先,将所谓的深度库普曼表示引入策略梯度中,实现对未知动力学系统的线性逼近,从而提高数据效率;其次,通过应用贝尔曼最优性原理,避免了由系统动力学逼近引起的长期任务累积误差。此外,本文提供了理论分析,证明了所提算法的渐近收敛性,并刻画了相应的采样复杂度。这些结论在多个具有挑战性的基准环境仿真中也得到了验证。