For stabilizing control tasks, model-free reinforcement learning (RL) approaches face numerous challenges, particularly regarding the issues of effectiveness and efficiency in complex high-dimensional environments with limited training data. To address these challenges, we propose Multi-Step Actor-Critic Learning with Lyapunov Certificates (MSACL), a novel approach that integrates exponential stability into off-policy maximum entropy reinforcement learning (MERL). In contrast to existing RL-based approaches that depend on elaborate reward engineering and single-step constraints, MSACL adopts intuitive reward design and exploits multi-step samples to enable exploratory actor-critic learning. Specifically, we first introduce Exponential Stability Labels (ESLs) to categorize training samples and propose a $λ$-weighted aggregation mechanism to learn Lyapunov certificates. Based on these certificates, we further design a stability-aware advantage function to guide policy optimization, thereby promoting rapid Lyapunov descent and robust state convergence. We evaluate MSACL across six benchmarks, comprising four stabilizing and two high-dimensional tracking tasks. Experimental results demonstrate its consistent performance improvements over both standard RL baselines and state-of-the-art Lyapunov-based RL algorithms. Beyond rapid convergence, MSACL exhibits robustness against environmental uncertainties and generalization to unseen reference signals. The source code and benchmarking environments are available at \href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}.
翻译:在稳定控制任务中,无模型强化学习(RL)方法面临诸多挑战,尤其是在数据有限的高维复杂环境中,有效性和效率问题尤为突出。为解决这些挑战,我们提出了基于Lyapunov证书的多步Actor-Critic学习(MSACL),这是一种将指数稳定性融入离策略最大熵强化学习(MERL)的新方法。与现有依赖精细奖励工程和单步约束的RL方法不同,MSACL采用直观的奖励设计,并利用多步样本实现探索性Actor-Critic学习。具体而言,我们首先引入指数稳定性标签(ESL)对训练样本进行分类,并提出一种基于λ加权聚合机制来学习Lyapunov证书。基于这些证书,我们进一步设计了面向稳定性的优势函数以引导策略优化,从而促进快速Lyapunov下降和鲁棒状态收敛。我们在六个基准任务上评估了MSACL,包括四个稳定控制任务和两个高维跟踪任务。实验结果表明,与标准RL基线及最先进的基于Lyapunov的RL算法相比,MSACL始终表现出性能提升。除快速收敛外,MSACL还对环境不确定性具有鲁棒性,并能泛化到未见过的参考信号。源代码和基准测试环境可从\href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}获取。