Establishing stability certificates for closed-loop systems under reinforcement learning (RL) policies is essential to move beyond empirical performance and offer guarantees of system behavior. Classical Lyapunov methods require a strict stepwise decrease in the Lyapunov function but such certificates are difficult to construct for learned policies. The RL value function is a natural candidate but it is not well understood how it can be adapted for this purpose. To gain intuition, we first study the linear quadratic regulator (LQR) problem and make two key observations. First, a Lyapunov function can be obtained from the value function of an LQR policy by augmenting it with a residual term related to the system dynamics and stage cost. Second, the classical Lyapunov decrease requirement can be relaxed to a generalized Lyapunov condition requiring only decrease on average over multiple time steps. Using this intuition, we consider the nonlinear setting and formulate an approach to learn generalized Lyapunov functions by augmenting RL value functions with neural network residual terms. Our approach successfully certifies the stability of RL policies trained on Gymnasium and DeepMind Control benchmarks. We also extend our method to jointly train neural controllers and stability certificates using a multi-step Lyapunov loss, resulting in larger certified inner approximations of the region of attraction compared to the classical Lyapunov approach. Overall, our formulation enables stability certification for a broad class of systems with learned policies by making certificates easier to construct, thereby bridging classical control theory and modern learning-based methods.
翻译:为强化学习策略下的闭环系统建立稳定性证书,对于超越经验性性能表现、提供系统行为保障至关重要。经典李雅普诺夫方法要求李雅普诺夫函数严格逐步递减,但此类证书难以针对学习得到的策略进行构建。强化学习价值函数是一个自然的候选对象,但如何将其适配于此目的尚不明确。为获得直观理解,我们首先研究线性二次调节器问题,并得出两个关键观察。首先,可以通过在LQR策略的价值函数上增加一个与系统动态及阶段成本相关的残差项,来获得一个李雅普诺夫函数。其次,经典的李雅普诺夫递减要求可以放宽为一种广义李雅普诺夫条件,该条件仅要求在多时间步长上平均递减。基于此直观认识,我们考虑非线性场景,并构建了一种通过用神经网络残差项增强强化学习价值函数来学习广义李雅普诺夫函数的方法。我们的方法成功验证了在Gymnasium和DeepMind Control基准测试上训练的强化学习策略的稳定性。我们还扩展了该方法,使用多步李雅普诺夫损失联合训练神经网络控制器和稳定性证书,与经典李雅普诺夫方法相比,获得了更大的吸引域内近似认证区域。总体而言,我们的公式通过使证书更易于构建,能够为具有学习策略的广泛系统类别提供稳定性认证,从而在经典控制理论与现代基于学习的方法之间架起桥梁。