Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging. In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration. We demonstrate our method's effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.
翻译:安全探索旨在解决强化学习(RL)在安全关键场景中的局限性,因为在试错学习过程中发生的故障可能带来高昂代价。现有多种方法通过引入外部知识或利用近端传感器数据来限制对不安全状态的探索。然而,在未知环境中降低探索风险仍具挑战性——智能体必须在探索过程中自行发现安全威胁。本文通过基于安全性需求的反例引导训练,针对安全探索问题提出新方法。该方法将连续和离散状态空间系统抽象为紧凑的抽象模型,表征智能体在探索过程中获取的安全相关知识。进而利用概率性反例生成技术构建最小模拟子模型以诱使安全性需求违反,使智能体能够通过离线训练高效优化策略,从而在后续在线探索中最小化违反安全规则的风险。初步实验表明,与标准QL和DQN算法相比,该方法使在线探索期间的安全违规次数平均减少40.3%;与前期相关工作相比减少29.1%,同时获得的累计奖励与无限制探索及其他替代方法相当。