In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
翻译:在离线强化学习中,我们基于固定数据集学习策略,无需与环境进行交互。其核心挑战在于确保所得策略的(1)性能与(2)安全性。一种称为安全策略改进(SPI)的技术可提供性能保障:新策略以高概率优于给定的安全基线策略。与此正交,安全强化学习中的屏蔽机制通过将动作空间限制为基于安全相关模型可证明安全的动作,提供安全性保障。我们通过将屏蔽机制扩展至离线强化学习,仅依赖可用数据集及对安全/不安全状态的认知,实现了这两种范式的融合。在此基础上,我们对策略改进步骤实施屏蔽,以高概率保证所得策略的安全性。实验结果表明,经屏蔽的SPI方法在平均性能与最差性能上均优于未屏蔽版本,尤其在低数据场景下优势显著。