The deployment of artificial intelligence (AI) in decision-making applications requires ensuring an appropriate level of safety and reliability, particularly in changing environments that contain a large number of unknown observations. To address this challenge, we propose a novel safe reinforcement learning (RL) approach that utilizes an anomalous state sequence to enhance RL safety. Our proposed solution Safe Reinforcement Learning with Anomalous State Sequences (AnoSeqs) consists of two stages. First, we train an agent in a non-safety-critical offline 'source' environment to collect safe state sequences. Next, we use these safe sequences to build an anomaly detection model that can detect potentially unsafe state sequences in a 'target' safety-critical environment where failures can have high costs. The estimated risk from the anomaly detection model is utilized to train a risk-averse RL policy in the target environment; this involves adjusting the reward function to penalize the agent for visiting anomalous states deemed unsafe by our anomaly model. In experiments on multiple safety-critical benchmarking environments including self-driving cars, our solution approach successfully learns safer policies and proves that sequential anomaly detection can provide an effective supervisory signal for training safety-aware RL agents
翻译:在决策应用中部署人工智能需要确保适当的安全性和可靠性水平,特别是在包含大量未知观测的动态变化环境中。为应对这一挑战,我们提出了一种新颖的安全强化学习方法,该方法利用异常状态序列来增强强化学习的安全性。我们提出的解决方案——基于异常状态序列的安全强化学习(AnoSeqs)包含两个阶段:首先,我们在非安全关键的离线“源”环境中训练智能体以收集安全状态序列;随后,利用这些安全序列构建异常检测模型,该模型能够在可能产生高代价故障的安全关键“目标”环境中检测潜在的不安全状态序列。异常检测模型估计的风险值被用于在目标环境中训练风险规避的强化学习策略,这涉及调整奖励函数以惩罚智能体访问被异常模型判定为不安全的异常状态。在包括自动驾驶在内的多个安全关键基准环境实验中,我们的解决方案成功学习了更安全的策略,并证明序列异常检测能为训练具备安全意识的强化学习智能体提供有效的监督信号。