Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has been significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking
翻译:强化学习在解决复杂序列决策问题方面取得了显著成就。然而,其在安全关键物理系统中的应用仍因缺乏稳定性保证而受到限制。标准强化学习算法以奖励最大化为优先目标,常导致产生可能引发振荡或状态无界发散的控制策略。现有研究在将基于Lyapunov的稳定性保证融入强化学习算法方面已取得重要进展,但面临三大核心挑战:Lyapunov候选函数的选择、因使用过多函数逼近器导致的计算复杂性,以及在训练过程中引入稳定性准则可能产生的保守策略。本研究提出一种基于Koopman算子理论的新型Lyapunov约束软演员-评论家算法。我们采用扩展动态模态分解方法构建系统的线性近似模型,并基于该近似推导Lyapunov候选函数的闭式解。将所得Lyapunov函数整合至SAC算法框架中,为稳定非线性系统的控制策略提供理论保证。通过在基于safe-control-gym构建的二维四旋翼轨迹跟踪环境中进行评估,结果表明:相较于基准原始SAC算法,所提算法在训练收敛性方面表现优异,且对Lyapunov稳定性准则的违反程度呈衰减趋势。项目代码已开源:https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking