This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.
翻译:本文研究离线安全模仿学习(IL)问题,其目标是从缺乏逐时间步安全成本或奖励信息的示范数据中学习安全且奖励最大化的策略。在许多现实场景中,环境中的在线学习可能具有风险,而精确指定安全成本又十分困难。然而,收集反映不良或不安全行为的轨迹通常是可行的,这些轨迹隐式传达了智能体应避免的行为。我们将此类轨迹称为非偏好轨迹。我们提出了一种新颖的离线安全IL算法OSIL,该算法从非偏好示范中推断安全性。我们将安全策略学习建模为约束马尔可夫决策过程(CMDP)。OSIL不依赖显式的安全成本与奖励标注,而是通过推导奖励最大化目标的下界并学习估计非偏好行为可能性的成本模型,重新形式化CMDP问题。我们的方法使智能体能完全基于离线示范数据学习安全且奖励最大化的行为。实验结果表明,本方法能在不降低奖励性能的前提下学习满足成本约束的更安全策略,其性能优于多种基线方法。