Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment. However, when deployed in a real environment, it may easily violate constraints that were originally satisfied during training because there might be model mismatch between the training and real environments. To address the above challenge, we formulate the problem as constrained RL under model uncertainty, where the goal is to learn a good policy that optimizes the reward and at the same time satisfy the constraint under model mismatch. We develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training. We demonstrate the effectiveness of our algorithm on a set of RL tasks with constraints.
翻译:现有关于约束强化学习的研究可在训练环境中获得表现良好的策略。然而,当部署到真实环境中时,由于训练环境与真实环境间可能存在模型失配,该策略容易违反原本在训练中满足的约束条件。为解决上述挑战,我们将问题形式化为模型不确定性下的约束强化学习,其目标是学习一个能优化奖励同时满足模型失配条件下约束的优质策略。我们提出鲁棒约束策略优化算法,该算法是首个适用于大规模/连续状态空间,并在训练期间每次迭代中对最差情况下的奖励提升与约束违例具有理论保证的算法。我们在具有约束的强化学习任务集上验证了算法的有效性。