In coming up with solutions to real-world problems, humans implicitly adhere to constraints that are too numerous and complex to be specified completely. However, reinforcement learning (RL) agents need these constraints to learn the correct optimal policy in these settings. The field of Inverse Constraint Reinforcement Learning (ICRL) deals with this problem and provides algorithms that aim to estimate the constraints from expert demonstrations collected offline. Practitioners prefer to know a measure of confidence in the estimated constraints, before deciding to use these constraints, which allows them to only use the constraints that satisfy a desired level of confidence. However, prior works do not allow users to provide the desired level of confidence for the inferred constraints. This work provides a principled ICRL method that can take a confidence level with a set of expert demonstrations and outputs a constraint that is at least as constraining as the true underlying constraint with the desired level of confidence. Further, unlike previous methods, this method allows a user to know if the number of expert trajectories is insufficient to learn a constraint with a desired level of confidence, and therefore collect more expert trajectories as required to simultaneously learn constraints with the desired level of confidence and a policy that achieves the desired level of performance.
翻译:在解决现实世界问题时,人类会隐式遵循大量过于复杂而无法完全明确指定的约束。然而,强化学习(RL)智能体需要这些约束来学习此类场景下的正确最优策略。逆向约束强化学习(ICRL)领域致力于解决该问题,其提供的算法旨在从离线收集的专家演示数据中估计约束条件。实践者更希望在决定使用这些估计约束前,能获得对其置信度的度量,从而仅采用满足特定置信水平的约束。然而,现有研究未能允许用户为推断的约束设定期望的置信水平。本研究提出一种原理性ICRL方法,该方法可接收置信水平参数与专家演示数据集,并输出至少与真实底层约束同等严格、且满足期望置信水平的约束条件。此外,与先前方法不同,本方法能使用户判断专家轨迹数量是否不足以学习达到期望置信水平的约束,从而根据需要收集更多专家轨迹,以同时学习满足期望置信水平的约束和达到预期性能水平的策略。