Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
翻译:强化学习(RL)算法在仿真环境中取得了巨大成功,但其在现实问题中的应用仍面临重大挑战,其中安全性是主要关注点。特别地,对于自动驾驶和机器人操作等许多具有挑战性的任务,强制实施状态级约束至关重要。然而,现有约束马尔可夫决策过程(CMDP)框架下的安全RL算法并未考虑状态级约束。为解决这一空白,我们提出状态级约束策略优化(SCPO),这是首个用于状态级约束强化学习的通用策略搜索算法。SCPO在期望意义上提供了状态级约束满足的保证。具体而言,我们引入了最大马尔可夫决策过程框架,并证明在最坏情况下,SCPO下的安全违规行为是有界的。我们通过在大量机器人运动任务中训练神经网络策略来展示该方法的有效性,其中智能体必须满足各种状态级安全约束。结果表明,SCPO显著优于现有方法,并且能够处理高维机器人任务中的状态级约束。