Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.

翻译：强化学习（RL）在过去数十年间彻底改变了众多领域的决策方法。然而，将强化学习策略部署到真实场景中时，确保安全性成为关键挑战。传统安全强化学习方法主要侧重于将预定义安全约束纳入策略学习过程。但这种对预定义安全约束的依赖在动态且不可预测的真实环境中存在局限性——此类约束可能无法获取或缺乏足够适应性。为弥合这一差距，我们提出了一种新颖方法，可同时学习安全强化学习控制策略并识别给定环境中的未知安全约束参数。通过采用参数化信号时序逻辑（pSTL）安全规约与少量初始标记数据集，我们将该问题构建为双层优化任务，巧妙融合了基于拉格朗日变体的双延迟深度确定性策略梯度（TD3）算法的约束策略优化，以及用于优化给定pSTL安全规约参数的贝叶斯优化。通过综合案例研究实验，我们验证了该方法在不同形式环境约束下的有效性，始终能生成兼具高回报的安全强化学习策略。此外，研究结果表明，我们成功学习了STL安全约束参数，其与真实环境安全约束具有高度一致性。模型性能近乎达到完全预知安全约束的理想场景水平，充分展现了其在准确识别环境安全约束与学习合规安全策略方面的卓越能力。