We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.
翻译:我们提出了一种面向强化学习的凸约束学习方法(CoCoRL),这是一种从可能具有不同奖励函数的安全演示中推断约束马尔可夫决策过程(CMDP)共享约束的新方法。尽管先前的工作仅限于已知奖励或完全已知环境动态的演示,但CoCoRL能够在未知环境动态的情况下,从具有不同未知奖励的演示中学习约束。CoCoRL基于演示构建了一个凸安全集,该集可证明即使在可能次优(但安全)的演示下也能保证安全性。对于接近最优的演示,CoCoRL以零策略遗憾收敛到真实安全集。我们在网格世界环境和包含多个约束的驾驶模拟场景中评估了CoCoRL。CoCoRL学习到的约束能引导安全驾驶行为。重要的是,我们能够安全地将所学约束迁移到不同任务和环境中。相比之下,基于逆强化学习(IRL)的替代方法通常表现较差,且会学习到不安全的策略。