Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

Safe offline RL is a promising way to bypass risky online interactions towards safe policy learning. Most existing methods only enforce soft constraints, i.e., constraining safety violations in expectation below thresholds predetermined. This can lead to potentially unsafe outcomes, thus unacceptable in safety-critical scenarios. An alternative is to enforce the hard constraint of zero violation. However, this can be challenging in offline setting, as it needs to strike the right balance among three highly intricate and correlated aspects: safety constraint satisfaction, reward maximization, and behavior regularization imposed by offline datasets. Interestingly, we discover that via reachability analysis of safe-control theory, the hard safety constraint can be equivalently translated to identifying the largest feasible region given the offline dataset. This seamlessly converts the original trilogy problem to a feasibility-dependent objective, i.e., maximizing reward value within the feasible region while minimizing safety risks in the infeasible region. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward maximization, and offline policy learning to be realized via three decoupled processes, while offering strong safety performance and stability. In FISOR, the optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning. Thus, we propose a novel energy-guided diffusion model that does not require training a complicated time-dependent classifier to extract the policy, greatly simplifying the training. We compare FISOR against baselines on DSRL benchmark for safe offline RL. Evaluation results show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.

翻译：安全离线强化学习是一种规避高风险在线交互、实现安全策略学习的有前景方法。现有方法大多仅施加软约束，即期望上的安全违规率低于预设阈值。这可能导致潜在的不安全后果，因此在安全关键场景中不可接受。另一种方案是强制执行零违规的硬约束，但在离线设置中这极具挑战性，因为需要平衡三个高度复杂且相互关联的方面：安全约束满足、奖励最大化以及离线数据集施加的行为正则化。有趣的是，我们发现通过安全控制理论的可达性分析，硬安全约束可等价转化为在给定离线数据集下识别最大可行域。这巧妙地将原始三元问题转化为可行域依赖目标：在可行域内最大化奖励值，同时在不可行域最小化安全风险。受此启发，我们提出FISOR（基于可行域引导的安全离线强化学习），该方法通过三个解耦过程分别实现安全约束遵守、奖励最大化和离线策略学习，同时具备强大的安全性能与稳定性。在FISOR中，转化后优化问题的最优策略可通过加权行为克隆的特殊形式导出。为此，我们提出了一种新颖的能量引导扩散模型，无需训练复杂的时间依赖分类器来提取策略，极大简化了训练过程。我们在DSRL安全离线强化学习基准上将FISOR与基线方法进行对比，评估结果表明FISOR是唯一能在所有任务中保证安全约束满足、同时在多数任务中取得最高回报的方法。