For safety-critical applications, model-free reinforcement learning (RL) faces numerous challenges, particularly the difficulty of establishing verifiable stability guarantees while maintaining high exploration efficiency. To address these challenges, we present Multi-Step Actor-Critic Learning with Lyapunov Certificates (MSACL), a novel approach that seamlessly integrates exponential stability with maximum entropy reinforcement learning (MERL). In contrast to existing methods that rely on complex reward engineering and single-step constraints, MSACL utilizes intuitive rewards and multi-step data for actor-critic learning. Specifically, we first introduce Exponential Stability Labels (ESLs) to categorize samples and propose a $λ$-weighted aggregation mechanism to learn Lyapunov certificates. Leveraging these certificates, we then develop a stability-aware advantage function to guide policy optimization, thereby ensuring rapid Lyapunov descent and robust state convergence. We evaluate MSACL across six benchmarks, comprising four stabilization and two high-dimensional tracking tasks. Experimental results demonstrate its consistent superiority over both standard RL baselines and state-of-the-art Lyapunov-based RL algorithms. Beyond rapid convergence, MSACL exhibits significant robustness against environmental uncertainties and remarkable generalization to unseen reference signals. The source code and benchmarking environments are available at \href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}.
翻译:在安全关键应用中,无模型强化学习面临诸多挑战,尤其是在保持高探索效率的同时建立可验证的稳定性保证。为应对这些挑战,我们提出基于李雅普诺夫证书的多步演员-评论家学习方法,该方法将指数稳定性与最大熵强化学习无缝集成。相较于依赖复杂奖励工程与单步约束的现有方法,MSACL采用直观的奖励函数并利用多步数据进行演员-评论家学习。具体而言,我们首先引入指数稳定性标签对样本进行分类,并提出基于$λ$加权的聚合机制以学习李雅普诺夫证书。借助这些证书,我们进一步构建稳定性感知的优势函数来指导策略优化,从而确保李雅普诺夫函数的快速下降与状态的鲁棒收敛。我们在包含四个镇定任务与两个高维跟踪任务的六项基准测试中评估MSACL。实验结果表明,该方法在标准强化学习基线及最先进的基于李雅普诺夫的强化学习算法中均表现出持续优越性。除快速收敛外,MSACL对环境不确定性展现出显著鲁棒性,并对未见参考信号具有卓越的泛化能力。源代码与基准测试环境发布于\href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}。