Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.
翻译:基于马尔可夫决策过程的序贯决策支撑着众多现实世界应用。无论是基于模型还是无模型的方法,在这些场景中都取得了显著成果。然而,现实任务必须在奖励最大化与安全约束之间取得平衡,这些通常是相互冲突的目标,可能导致不稳定的最小/最大对抗优化。一种有前景的替代方案是安全可达性分析,它预先计算一个前向不变的安全状态-动作集合,确保从该集合内出发的智能体能无限期地保持安全。然而,大多数基于可达性的方法仅处理硬安全约束,很少有研究将可达性扩展到累积成本约束。为此,我们首先定义了一个安全条件化的可达性集合,该集合将奖励最大化与累积安全成本约束解耦。其次,我们展示了该集合如何在不依赖不稳定最小/最大或拉格朗日优化的情况下强制执行安全约束,从而提出一种新颖的离线安全强化学习算法,该算法能从固定数据集(无环境交互)中学习安全策略。最后,在标准离线安全强化学习基准测试以及一项真实世界的海上导航任务上的实验表明,我们的方法在保持安全性的同时,匹配或超越了最先进的基线方法。