Safe reinforcement learning (RL) focuses on training reward-maximizing agents subject to pre-defined safety constraints. Yet, learning versatile safe policies that can adapt to varying safety constraint requirements during deployment without retraining remains a largely unexplored and challenging area. In this work, we formulate the versatile safe RL problem and consider two primary requirements: training efficiency and zero-shot adaptation capability. To address them, we introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules: (1) Versatile Value Estimation (VVE) for approximating value functions under unseen threshold conditions, and (2) Conditioned Variational Inference (CVI) for encoding arbitrary constraint thresholds during policy optimization. Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications.
翻译:安全强化学习主要关注在预定义安全约束下训练奖励最大化智能体。然而,学习能够在不重新训练的情况下适应部署过程中不同安全约束要求的多功能安全策略,仍是一个鲜有探索且极具挑战性的领域。本文系统定义了多功能安全强化学习问题,并考虑了训练效率与零样本自适应能力这两项核心要求。为此,我们提出了条件约束策略优化框架,该框架包含两个关键模块:(1) 多功能价值估计模块——用于逼近未知阈值条件下的价值函数;(2) 条件变分推断模块——在策略优化过程中编码任意约束阈值。大量实验表明,条件约束策略优化框架在安全性和任务性能方面均优于基线方法,同时能够以数据高效的方式保留对不同约束阈值的零样本自适应能力。这使得我们的方法适用于实际动态应用场景。