In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.
翻译:近年来,强化学习在解决现实问题时取得了日益显著的成功。此类应用通常需要对智能体的行为施加约束。现有约束强化学习算法依赖梯度下降-上升法,但该方法存在明显局限性:尽管这些算法能保证平均意义上的收敛性,却无法确保末次迭代的收敛性——即智能体的当前策略可能永远无法收敛至最优解。在实践中常观察到,智能体的策略会在满足约束与最大化奖励之间交替徘徊,极少能同时实现两个目标。本文通过提出"基于乐观上升-下降法的约束强化学习"这一理论完善的约束强化学习方法,解决了上述问题。该方法具有可证明的末次迭代收敛性保证。我们在包括离散马尔可夫决策过程和连续控制任务在内的多种约束强化学习问题上验证了其经验有效性,并在此过程中建立了具有挑战性的约束强化学习问题基准测试集。