Accurate risk quantification and reachability analysis are crucial for safe control and learning, but sampling from rare events, risky states, or long-term trajectories can be prohibitively costly. Motivated by this, we study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories. The use of maximal safety probability in control and learning is expected to avoid conservative behaviors due to over-approximation of risk. Here, we first show that long-term safety probability, which is multiplicative in time, can be converted into additive costs and be solved using standard reinforcement learning methods. We then derive this probability as solutions of partial differential equations (PDEs) and propose Physics-Informed Reinforcement Learning (PIRL) algorithm. The proposed method can learn using sparse rewards because the physics constraints help propagate risk information through neighbors. This suggests that, for the purpose of extracting more information for efficient learning, physics constraints can serve as an alternative to reward shaping. The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states. This feature is in stark contrast with the unconstrained deep RL that demands sufficient data coverage. These merits of the proposed method are demonstrated in numerical simulation.
翻译:精确的风险量化与可达性分析对于安全控制与学习至关重要,但针对罕见事件、危险状态或长期轨迹的采样往往代价高昂。受此启发,本文研究如何在缺乏危险状态及长期轨迹充分样本覆盖的条件下,估计最优安全动作的长期安全概率。在控制与学习中采用最大安全概率,有助于避免因风险过度近似导致的保守行为。首先,我们证明随时间累积的长期安全概率可转化为可加性代价,并利用标准强化学习方法求解。随后,我们将该概率推导为偏微分方程的解,并提出基于物理信息的强化学习算法。由于物理约束可通过邻域传播风险信息,该方法能够利用稀疏奖励进行学习。这表明,为提取更多信息以实现高效学习,物理约束可作为奖励塑形的替代方案。此外,所提方法可利用短期样本估计长期风险,并推断未采样状态的风险水平——这一特性与依赖充分数据覆盖的无约束深度强化学习形成鲜明对比。数值仿真验证了所提方法的上述优势。