Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.
翻译:强化学习(RL)的两个主要挑战是设计合适的奖励函数以及确保所学策略的安全性。针对这些挑战,我们提出了约束马尔可夫决策过程中逆强化学习(IRL)的理论框架。从凸分析的角度,我们将先前关于奖励可辨识性和泛化性的结论扩展到约束设置以及更广泛的规则化类别。特别地,我们证明了熵规则化会导致可辨识性至多达到势能整形(Cao等人,2021)的程度,而对于其他规则化方式或存在安全约束时,这一性质通常不再成立。我们还表明,为确保对新转移律和约束的泛化性,真实奖励必须被辨识至多相差一个常数。此外,我们推导了所学奖励次优性的有限样本保证,并在网格世界环境中验证了我们的结果。