In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties (e.g., safety, stability) under the learned controller. However, as existing methods typically apply formal verification \emph{after} the controller has been learned, it is sometimes difficult to obtain any certificate, even after many iterations between learning and verification. To address this challenge, we propose a framework that jointly conducts reinforcement learning and formal verification by formulating and solving a novel bilevel optimization problem, which is differentiable by the gradients from the value function and certificates. Experiments on a variety of examples demonstrate the significant advantages of our framework over the model-based stochastic value gradient (SVG) method and the model-free proximal policy optimization (PPO) method in finding feasible controllers with barrier functions and Lyapunov functions that ensure system safety and stability.
翻译:在面向安全关键控制系统的基于模型的强化学习中,正式认证所学控制器下的系统属性(如安全性、稳定性)至关重要。然而,由于现有方法通常在控制器学习完成后才进行形式化验证,即便经过学习与验证间的多次迭代,有时仍难以获得任何认证。为应对这一挑战,本文提出一个联合进行强化学习与形式化验证的框架,通过构建并求解一个新型双层优化问题实现(该问题通过值函数与认证的梯度具有可微性)。在多个示例上的实验表明,相较于基于模型的随机值梯度(SVG)方法和无模型的近端策略优化(PPO)方法,本框架在寻找具有确保系统安全性与稳定性的屏障函数与李雅普诺夫函数的可行控制器方面具有显著优势。