Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment. However, when deployed in a real environment, it may easily violate constraints that were originally satisfied during training because there might be model mismatch between the training and real environments. To address the above challenge, we formulate the problem as constrained RL under model uncertainty, where the goal is to learn a good policy that optimizes the reward and at the same time satisfy the constraint under model mismatch. We develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training. We demonstrate the effectiveness of our algorithm on a set of RL tasks with constraints.
翻译:现有关于约束强化学习的研究可能在训练环境中获得性能良好的策略。然而,当部署到真实环境时,由于训练环境与真实环境之间存在模型不匹配,原本在训练中满足的约束可能被轻易违反。为解决上述挑战,我们将该问题形式化为模型不确定性下的约束强化学习,其目标是学习一个在模型不匹配下既能优化奖励又能同时满足约束的优秀策略。我们提出了一种鲁棒约束策略优化(RCPO)算法,这是首个适用于大规模/连续状态空间,且在训练过程中每次迭代都能对最坏情况下的奖励提升和约束违反提供理论保证的算法。我们在一组带有约束的强化学习任务上验证了该算法的有效性。