Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging. In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration. We demonstrate our method's effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.
翻译:安全探索旨在解决强化学习在安全关键场景中的局限性——在试错学习过程中发生的故障可能产生高昂代价。现有多种方法可融合外部知识或利用近端传感器数据来限制对不安全状态的探索。然而,在未知环境中降低探索风险仍具挑战性——智能体必须在探索过程中自主发现安全威胁。本文通过利用安全需求的反例来引导训练,旨在解决安全探索问题。我们将连续和离散状态空间系统抽象为紧凑的抽象模型,表征智能体在探索过程中获得的安全相关知识。进而利用概率反例生成技术构建最小仿真子模型以诱发安全需求违反行为,使智能体能够在该模型上进行离线训练以优化策略,从而最小化后续在线探索中违反安全规则的风险。初步实验表明,与QL和DQN标准算法相比,本方法在在线探索期间平均降低40.3%的安全违规率;与先前相关工作相比降低29.1%,同时在与无限制探索及其他对比方法相当的累积奖励水平下实现了上述效果。