A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates ``good'' trajectories and avoids ``bad'' trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as ``good'' or ``bad''. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.
翻译:在强化学习(RL)中强制执行安全行为的流行框架是约束强化学习,其中采用基于轨迹的预期成本(或其他成本度量)约束来确保安全,更重要的是,在最大化预期奖励的同时强制执行这些约束。最近大多数解决约束强化学习的方法将基于轨迹的成本约束转化为一个替代问题,该问题可通过RL方法的微小修改来解决。这类方法的一个关键缺陷是每个状态下成本约束的过度或不足估计。因此,我们提出了一种方法,不修改基于轨迹的成本约束,而是模仿从渐进改进策略中生成的“良好”轨迹并避免“糟糕”轨迹。我们利用一个预言机,该预言机使用奖励阈值(随学习变化)和整体成本约束来将轨迹标记为“良好”或“糟糕”。我们方法的一个关键优势是能够从任何初始策略或轨迹集开始并加以改进。在一系列详尽的实验中,我们证明,在预期成本、条件风险价值(CVaR)成本甚至未知成本约束方面,我们的方法能够超越解决约束强化学习问题的顶级基准方法。