Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking
翻译:强化学习在解决复杂序贯决策问题方面取得了显著成就。然而,其在安全关键物理系统中的应用仍因缺乏稳定性保证而受到限制。标准强化学习算法以奖励最大化为优先目标,往往产生可能引发振荡或状态无界发散的策略。现有研究在强化学习算法中融入基于Lyapunov的稳定性保证方面已取得重要进展,但面临三大核心挑战:Lyapunov候选函数的选择、因使用过多函数逼近器导致的计算复杂性,以及在训练过程中引入稳定性准则可能产生的保守策略。本研究提出一种基于Koopman算子理论的新型Lyapunov约束软演员-评论家算法。我们采用扩展动态模态分解方法构建系统的线性近似模型,并基于该近似推导Lyapunov候选函数的闭式解。将所得Lyapunov函数融入SAC算法框架,可为稳定非线性系统的策略提供理论保证。基于safe-control-gym的二维四旋翼轨迹跟踪实验表明:相较于基准SAC算法,所提算法在训练收敛性和Lyapunov稳定性准则违反程度的衰减方面均表现出更优性能。项目代码库:https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking