In many Deep Reinforcement Learning (RL) problems, decisions in a trained policy vary in significance for the expected safety and performance of the policy. Since RL policies are very complex, testing efforts should concentrate on states in which the agent's decisions have the highest impact on the expected outcome. In this paper, we propose a novel model-based method to rigorously compute a ranking of state importance across the entire state space. We then focus our testing efforts on the highest-ranked states. In this paper, we focus on testing for safety. However, the proposed methods can be easily adapted to test for performance. In each iteration, our testing framework computes optimistic and pessimistic safety estimates. These estimates provide lower and upper bounds on the expected outcomes of the policy execution across all modeled states in the state space. Our approach divides the state space into safe and unsafe regions upon convergence, providing clear insights into the policy's weaknesses. Two important properties characterize our approach. (1) Optimal Test-Case Selection: At any time in the testing process, our approach evaluates the policy in the states that are most critical for safety. (2) Guaranteed Safety: Our approach can provide formal verification guarantees over the entire state space by sampling only a fraction of the policy. Any safety properties assured by the pessimistic estimate are formally proven to hold for the policy. We provide a detailed evaluation of our framework on several examples, showing that our method discovers unsafe policy behavior with low testing effort.
翻译:在许多深度强化学习问题中,经过训练的策略中的决策对预期安全性和策略性能的重要性各不相同。由于强化学习策略非常复杂,测试工作应集中在智能体决策对预期结果影响最大的状态上。本文提出一种新颖的基于模型的方法,用于严格计算整个状态空间中各状态重要性的排序。随后我们将测试重点集中在排名最高的状态上。本文主要关注安全性测试,但所提出的方法可轻松调整为性能测试。在每次迭代中,我们的测试框架会计算乐观和悲观的安全估计值。这些估计值为策略执行在所有建模状态空间中的预期结果提供了下界和上界。我们的方法在收敛时将状态空间划分为安全区域和危险区域,从而清晰揭示策略的薄弱环节。本方法具有两个重要特性:(1)最优测试用例选择:在测试过程的任何时刻,我们的方法都会评估对安全性最关键的状态;(2)安全性保证:仅需对策略进行部分采样,我们的方法即可为整个状态空间提供形式化验证保证。任何由悲观估计值确认的安全属性,均可被形式化证明适用于该策略。我们通过多个案例对框架进行了详细评估,结果表明本方法能以较低的测试成本发现策略的不安全行为。