Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces a unique cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.
翻译:深度强化学习在各类控制任务中表现卓越,但缺乏安全保障限制了其实际应用。具体而言,学习过程中的探索通常导致安全违规,而强化学习智能体正是从这些错误中学习。另一方面,安全控制技术虽能确保持续满足安全性要求,但对系统动力学模型需具备强先验知识,这在实践中往往难以获取。为解决这些问题,我们提出安全引导的状态约束策略优化(S-3PO),这是一种开创性算法,能够生成状态级安全的最优策略,且训练过程中实现零违规,即学习过程中无犯错。S-3PO首先采用基于黑箱动力学的安全导向监控器来确保安全探索,随后为强化学习智能体施加独特代价函数,使其在安全约束内收敛至最优行为。在高维机器人任务中,S-3PO在管理状态约束时实现零训练违规,性能超越现有方法。这一创新标志着安全强化学习走向实际部署的重要突破。