Model-based reinforcement learning seeks to simultaneously learn the dynamics of an unknown stochastic environment and synthesise an optimal policy for acting in it. Ensuring the safety and robustness of sequential decisions made through a policy in such an environment is a key challenge for policies intended for safety-critical scenarios. In this work, we investigate two complementary problems: first, computing reach-avoid probabilities for iterative predictions made with dynamical models, with dynamics described by Bayesian neural network (BNN); second, synthesising control policies that are optimal with respect to a given reach-avoid specification (reaching a "target" state, while avoiding a set of "unsafe" states) and a learned BNN model. Our solution leverages interval propagation and backward recursion techniques to compute lower bounds for the probability that a policy's sequence of actions leads to satisfying the reach-avoid specification. Such computed lower bounds provide safety certification for the given policy and BNN model. We then introduce control synthesis algorithms to derive policies maximizing said lower bounds on the safety probability. We demonstrate the effectiveness of our method on a series of control benchmarks characterized by learned BNN dynamics models. On our most challenging benchmark, compared to purely data-driven policies the optimal synthesis algorithm is able to provide more than a four-fold increase in the number of certifiable states and more than a three-fold increase in the average guaranteed reach-avoid probability.
翻译:基于模型的强化学习旨在同时学习未知随机环境的动态特性,并综合出在该环境中执行的最优策略。确保通过策略在环境中做出的顺序决策的安全性与鲁棒性,是面向安全关键场景的策略所面临的核心挑战。本研究探讨了两个互补性问题:首先,针对由贝叶斯神经网络(BNN)描述动态的可迭代预测模型,计算其可达-规避概率;其次,基于给定的可达-规避规范(达到“目标”状态,同时避开一组“不安全”状态)与学习的BNN模型,综合出最优控制策略。我们的解决方案利用区间传播与反向递推技术,计算策略动作序列满足可达-规避规范的概率下界。此类计算出的下界为给定策略与BNN模型提供了安全性认证。随后,我们引入控制综合算法,以推导出最大化上述安全概率下界的策略。通过一系列以学习到的BNN动态模型为特征的基准控制问题,我们验证了所提方法的有效性。在最具挑战性的基准测试中,与纯数据驱动的策略相比,最优综合算法在可认证状态数量上实现了超过四倍的提升,并在平均保证可达-规避概率上实现了超过三倍的提升。