In current model-free reinforcement learning (RL) algorithms, stability criteria based on sampling methods are commonly utilized to guide policy optimization. However, these criteria only guarantee the infinite-time convergence of the system's state to an equilibrium point, which leads to sub-optimality of the policy. In this paper, we propose a policy optimization technique incorporating sampling-based Lyapunov stability. Our approach enables the system's state to reach an equilibrium point within an optimal time and maintain stability thereafter, referred to as "optimal-time stability". To achieve this, we integrate the optimization method into the Actor-Critic framework, resulting in the development of the Adaptive Lyapunov-based Actor-Critic (ALAC) algorithm. Through evaluations conducted on ten robotic tasks, our approach outperforms previous studies significantly, effectively guiding the system to generate stable patterns.
翻译:在当前的无模型强化学习(RL)算法中,基于采样方法的稳定性准则被广泛用于指导策略优化。然而,这些准则仅能保证系统状态在无限时间内收敛到平衡点,从而导致策略的次优性。本文提出了一种结合采样李雅普诺夫稳定性的策略优化技术。该方法能使系统状态在最优时间内到达平衡点并在此后保持稳定,称之为"最优时间稳定性"。为此,我们将优化方法融入演员-评论家框架,开发了自适应李雅普诺夫演员-评论家(ALAC)算法。通过在十项机器人任务上的评估,本方法显著优于先前研究,有效引导系统生成稳定模式。