A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. In this class of problems, we show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. Hence, there exist constrained reinforcement learning problems for which neither regularized nor classical primal-dual methods yield optimal policies. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods as the portion of the dynamics that drives the multipliers evolution. This approach provides a systematic state augmentation procedure that is guaranteed to solve reinforcement learning problems with constraints. Thus, as we illustrate by an example, while previous methods can fail at finding optimal policies, running the dual dynamics while executing the augmented policy yields an algorithm that provably samples actions from the optimal policy.
翻译:约束强化学习的一种常见形式涉及多个奖励,这些奖励需各自累积到给定阈值。在这类问题中,我们展示了一个简单示例:所需的最优策略无法通过奖励的任何加权线性组合导出。因此,存在一些约束强化学习问题,正则化方法和经典原始-对偶方法均无法得出最优策略。本研究通过使用拉格朗日乘子增强状态,并将原始-对偶方法重新解释为推动乘子演化的动力部分,从而解决了这一缺陷。该方法提供了一种系统的状态增强过程,确保能解决带约束的强化学习问题。因此,正如我们通过示例所展示的,尽管先前方法可能无法找到最优策略,但在执行增强策略的同时运行对偶动力学,能生成一种算法,可证明地从最优策略中采样动作。