Recently, safe reinforcement learning (RL) with the actor-critic structure for continuous control tasks has received increasing attention. It is still challenging to learn a near-optimal control policy with safety and convergence guarantees. Also, few works have addressed the safe RL algorithm design under time-varying safety constraints. This paper proposes a safe RL algorithm for optimal control of nonlinear systems with time-varying state and control constraints. In the proposed approach, we construct a novel barrier force-based control policy structure to guarantee control safety. A multi-step policy evaluation mechanism is proposed to predict the policy's safety risk under time-varying safety constraints and guide the policy to update safely. Theoretical results on stability and robustness are proven. Also, the convergence of the actor-critic implementation is analyzed. The performance of the proposed algorithm outperforms several state-of-the-art RL algorithms in the simulated Safety Gym environment. Furthermore, the approach is applied to the integrated path following and collision avoidance problem for two real-world intelligent vehicles. A differential-drive vehicle and an Ackermann-drive one are used to verify offline deployment and online learning performance, respectively. Our approach shows an impressive sim-to-real transfer capability and a satisfactory online control performance in the experiment.
翻译:近期,面向连续控制任务的演员-评论家结构安全强化学习受到越来越多的关注。然而,在保证安全性和收敛性的前提下学习近似最优控制策略仍具挑战性。此外,针对时变安全约束的安全强化学习算法设计鲜有研究。本文提出一种用于非线性系统最优控制的时变状态与控制约束安全强化学习算法。在所提方法中,我们构建了一种新颖的基于势垒力的控制策略结构以保证控制安全性,并提出多步策略评估机制以预测策略在时变安全约束下的安全风险,从而引导策略进行安全更新。本文证明了稳定性和鲁棒性的理论结果,并分析了演员-评论家实现的收敛性。在仿真Safety Gym环境中,所提算法性能优于多种先进强化学习算法。进一步,该方法被应用于两辆真实智能车辆的综合路径跟踪与避障问题,分别采用差速驱动车辆和阿克曼驱动车辆验证离线部署与在线学习性能。实验表明,我们的方法展现出令人印象深刻的仿真到现实迁移能力及令人满意的在线控制性能。