Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
翻译:强化学习算法在仿真环境中取得了巨大成功,但其在现实问题中的应用仍面临重大挑战,其中安全性是主要关注点。特别是在自动驾驶和机器人操作等许多具有挑战性的任务中,强制实施逐状态约束至关重要。然而,在约束马尔可夫决策过程框架下的现有安全强化学习算法并未考虑逐状态约束。为填补这一空白,我们提出了逐状态约束策略优化(SCPO),这是首个面向逐状态约束强化学习的通用策略搜索算法。SCPO可在期望意义上保证逐状态约束的满足。具体而言,我们引入了最大马尔可夫决策过程框架,并证明了在SCPO下最坏情况下的安全违规是有界的。我们在大量机器人运动任务中训练神经网络策略,验证了该方法在满足各种逐状态安全约束方面的有效性。结果表明,SCPO显著优于现有方法,并能处理高维机器人任务中的逐状态约束。