Reinforcement learning (RL) has become the de facto method for achieving locomotion on humanoid robots in practice, yet stability analysis of the corresponding control policies is lacking. Recent work has attempted to merge control theoretic ideas with reinforcement learning through control guided learning. A notable example of this is the use of a control Lyapunov function (CLF) to synthesize the reinforcement learning rewards, a technique known as CLF-RL, which has shown practical success. This paper investigates the stability properties of optimal controllers using CLF-RL with the goal of bridging experimentally observed stability with theoretical guarantees. The RL problem is viewed as an optimal control problem and exponential stability is proven in both continuous and discrete time using both core CLF reward terms and the additional terms used in practice. The theoretical bounds are numerically verified on systems such as the double integrator and cart-pole. Finally, the CLF guided rewards are implemented for a walking humanoid robot to generate stable periodic orbits.
翻译:强化学习(RL)已成为实现人形机器人实际运动的事实标准方法,但相应控制策略的稳定性分析仍然缺乏。近期研究尝试通过控制引导学习将控制理论思想与强化学习相结合,其中利用控制李雅普诺夫函数(CLF)综合强化学习奖励的技术(即CLF-RL)是典型成功案例。本文旨在研究采用CLF-RL的最优控制器的稳定性特性,以期在实验观测稳定性与理论保证之间建立联系。本文将强化学习问题视为最优控制问题,并证明在连续时间和离散时间下,使用核心CLF奖励项及实际应用中的附加项均可实现指数稳定性。通过双积分器和车杆系统等案例对理论边界进行了数值验证。最后,将CLF引导奖励应用于行走人形机器人以生成稳定周期轨道。