Safety assurance of Reinforcement Learning (RL) is critical for exploration in real-world scenarios. In handling the Constrained Markov Decision Process, current approaches experience intrinsic difficulties in trading-off between optimality and feasibility. Direct optimization methods cannot strictly guarantee state-wise in-training safety while projection-based methods are usually inefficient and correct actions through lengthy iterations. To address these two challenges, this paper proposes an adaptive surrogate chance constraint for the safety cost, and a hierarchical architecture that corrects actions produced by the upper policy layer via a fast Quasi-Newton method. Theoretical analysis indicates that the relaxed probabilistic constraint can sufficiently guarantee forward invariance to the safe set. We validate the proposed method on 4 simulated and real-world safety-critical robotic tasks. Results indicate that the proposed method can efficiently enforce safety (nearly zero-violation), while preserving optimality (+23.8%), robustness and generalizability to stochastic real-world settings.
翻译:强化学习(RL)在实际场景中探索时的安全性保证至关重要。在处理约束马尔可夫决策过程时,当前方法在权衡最优性与可行性方面存在固有困难。直接优化方法无法严格保证训练过程中的状态级安全性,而基于投影的方法通常效率低下,且需要冗长迭代才能修正动作。为应对这两个挑战,本文提出了一种针对安全代价的自适应代理机会约束,并设计了一种分层架构,该架构通过快速拟牛顿法对上层策略层产生的动作进行修正。理论分析表明,松弛后的概率约束能够充分保证对安全集的前向不变性。我们在四个模拟及真实世界的安全关键机器人任务上验证了所提方法。结果表明,该方法能够高效地保障安全性(近乎零违规),同时保持最优性(提升23.8%)、鲁棒性以及对随机真实环境的泛化能力。