Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.
翻译:约束强化学习已被用于通过期望成本约束来强制执行策略的安全性约束。关键挑战在于处理策略累积的期望成本,而非仅关注单步成本。现有方法已开发出创新方式,将整个策略的成本约束转化为局部决策(每个时间步)上的约束。尽管此类方法在目标优化方面提供了良好解决方案,但在成本控制上可能过于激进或保守——这是因局部成本约束中使用了“未来”或“反向”成本的估计值所致。为此,我们提出一种与约束强化学习等价的非约束性公式,该公式具有增强的状态空间和奖励惩罚。这一直观的公式具有通用性及有趣的理论性质。更重要的是,它为有效解决约束强化学习问题提供了新范式。如实验结果显示,我们在文献中的多个基准问题上均超越了主流方法。