Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing a long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.
翻译:安全性是将强化学习应用于实际问题的必要条件。尽管近年来涌现了大量安全强化学习算法,但现有工作通常存在以下问题:1)依赖数值型安全反馈;2)无法保证学习过程中的安全性;3)将问题局限于先验已知的确定性转移动力学;和/或4)假设存在适用于任意状态的已知安全策略。针对上述问题,我们提出了一种面向约束马尔可夫决策过程的长期二元反馈安全强化学习算法——LoBiSaRL。该算法适用于二元安全反馈和未知随机状态转移函数场景,通过优化策略最大化奖励的同时,确保智能体在整个回合中以高概率仅执行安全状态-动作对,从而实现长期安全保证。具体而言,LoBiSaRL采用广义线性模型对二元安全函数进行建模,并在每个时间步保守地选取安全动作,同时在合理假设下推断该动作对未来安全性的影响。理论结果表明,LoBiSaRL能够以高概率满足长期安全约束。最后,实证结果表明,与现有方法相比,我们的算法在奖励性能未显著降低的前提下具有更高的安全性。