We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in tabular environments and a continuous driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior and that can be transferred to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.
翻译:我们提出用于强化学习的凸约束学习(CoCoRL),这是一种新颖的方法,能够从一组可能存在不同奖励函数的安演示中推断约束马尔可夫决策过程(CMDP)中的共享约束。以往的研究局限于演示具有已知奖励或完全已知的环境动力学,而CoCoRL可以在不了解环境动力学的情况下,从具有不同未知奖励的演示中学习约束。CoCoRL基于演示构建一个凸安全集,该集合即使对于次优(但安全)的演示也能严格保证安全性。对于接近最优的演示,CoCoRL收敛到真实安全集且无策略遗憾。我们在表格化环境以及包含多种约束的连续驾驶仿真中评估了CoCoRL。CoCoRL能够学习导致安全驾驶行为的约束,并且这些约束可以迁移到不同的任务和环境中。相比之下,基于逆强化学习(IRL)的替代方法往往表现不佳并习得不安全策略。