Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.

翻译：强化学习（RL）在过去几十年中已在众多领域引发了决策过程的革命性变革。然而，将RL策略部署至实际场景仍面临保障安全性的关键挑战。传统安全RL方法主要聚焦于将预定义安全约束融入策略学习过程，但这种对预定义约束的依赖在动态不可预测的真实环境中存在局限性——此类环境可能无法获取或适配此类约束。为弥补这一缺陷，我们提出了一种新颖方法，可在学习安全RL控制策略的同时，识别给定环境中未知的安全约束参数。通过初始化参数化信号时序逻辑（pSTL）安全规范与小型初始标记数据集，我们将问题构建为双层优化任务：采用拉格朗日变体双延迟深度确定性策略梯度（TD3）算法进行受约束策略优化，并融合贝叶斯优化方法优化pSTL安全规范参数。通过综合案例研究的实验验证，我们证实了该方法在不同环境约束形式下的有效性，其始终能生成具有高回报的安全RL策略。此外，研究结果表明该方法成功学习了STL安全约束参数，且与真实环境安全约束高度一致。我们的模型性能近乎完美匹配了完全掌握安全约束先验知识的理想场景，充分展示了其在准确识别环境安全约束及习得遵循约束的安全策略方面的卓越能力。