Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.
翻译:传统强化学习方法缺乏提供稳定性保证的能力。近期算法通过在学习控制策略的同时学习李雅普诺夫函数来确保稳定学习。然而,当前自学习李雅普诺夫函数因其同策略特性而存在样本效率低下的问题。本文提出一种离策略学习李雅普诺夫函数的方法,并将所提出的离策略李雅普诺夫函数集成到Soft Actor Critic和Proximal Policy Optimization算法中,为它们提供数据高效的稳定性证明。通过倒立摆和四旋翼飞行器的仿真实验表明,当两种算法配备所提出的离策略李雅普诺夫函数时,其性能得到显著提升。