Despite the tremendous success of Reinforcement Learning (RL) algorithms in simulation environments, applying RL to real-world applications still faces many challenges. A major concern is safety, in another word, constraint satisfaction. State-wise constraints are one of the most common constraints in real-world applications and one of the most challenging constraints in Safe RL. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving, robot manipulation. This paper provides a comprehensive review of existing approaches that address state-wise constraints in RL. Under the framework of State-wise Constrained Markov Decision Process (SCMDP), we will discuss the connections, differences, and trade-offs of existing approaches in terms of (i) safety guarantee and scalability, (ii) safety and reward performance, and (iii) safety after convergence and during training. We also summarize limitations of current methods and discuss potential future directions.
翻译:尽管强化学习算法在模拟环境中取得了巨大成功,但将其应用于现实世界仍面临诸多挑战。其中主要问题是安全性,即约束满足。逐状态约束是现实应用中最常见的约束类型之一,也是安全强化学习中最具挑战性的约束之一。在自动驾驶、机器人操作等复杂任务中,实施逐状态约束具有必要性和重要性。本文系统回顾了现有解决强化学习中逐状态约束问题的方法。在逐状态约束马尔可夫决策过程框架下,我们从以下维度探讨现有方法的关联性、差异性及权衡关系:(i)安全保证与可扩展性,(ii)安全性与奖励性能,以及(iii)收敛后安全性与训练过程中安全性。同时,我们总结了当前方法的局限性,并讨论了潜在的研究方向。