Safe and optimal controller synthesis for switched-controlled hybrid systems, which combine differential equations and discrete changes of the system's state, is known to be intricately hard. Reinforcement learning has been leveraged to construct near-optimal controllers, but their behavior is not guaranteed to be safe, even when it is encouraged by reward engineering. One way of imposing safety to a learned controller is to use a shield, which is correct by design. However, obtaining a shield for non-linear and hybrid environments is itself intractable. In this paper, we propose the construction of a shield using the so-called barbaric method, where an approximate finite representation of an underlying partition-based two-player safety game is extracted via systematically picked samples of the true transition function. While hard safety guarantees are out of reach, we experimentally demonstrate strong statistical safety guarantees with a prototype implementation and UPPAAL STRATEGO. Furthermore, we study the impact of the synthesized shield when applied as either a pre-shield (applied before learning a controller) or a post-shield (only applied after learning a controller). We experimentally demonstrate superiority of the pre-shielding approach. We apply our technique on a range of case studies, including two industrial examples, and further study post-optimization of the post-shielding approach.
翻译:针对结合微分方程与系统状态离散变化的切换控制混杂系统,实现安全且最优的控制器合成被公认为极其困难。强化学习虽可构建近优控制器,但其行为无法保证安全性——即使通过奖励工程加以激励也是如此。对学习到的控制器施加安全性的方法之一是使用防护机制,该机制通过设计保证正确性。然而,为非线性与混杂环境构建防护机制本身便难以处理。本文提出采用所谓"野蛮方法"构建防护机制:通过系统选取真实转移函数的样本,提取基于分区双人安全博弈的近似有限表示。尽管无法给出严格的安全保证,我们通过原型实现与UPPAAL STRATEGO进行的实验证明其具备强统计安全性保证。此外,我们研究了合成防护机制作为预防护(在控制器学习前应用)与后防护(仅在控制器学习后应用)两种模式的影响,实验表明预防护方法具有显著优越性。我们将该技术应用于包括两个工业案例在内的系列研究实例,并进一步探讨了后防护方法的优化策略。