Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $ε$ sub-optimality and feasible policy after $O(ε^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($γ$) and by at least 6x for larger value of $γ$.

翻译：约束决策对于设计现实世界控制系统中的安全策略至关重要，然而仿真环境往往无法捕捉现实世界的对抗性。我们考虑学习一种策略的问题，该策略即使在真实模型与可访问的仿真器/名义模型存在不匹配时，也能在满足约束的同时最大化累积奖励。具体而言，我们考虑鲁棒约束马尔可夫决策问题（RCMDP），其中智能体需要在以未知名义模型为中心的不确定性集合下，针对最坏可能的随机模型最大化奖励并满足约束。对标准约束MDP（CMDP）有效的原始-对偶方法在此不适用，因为缺乏强对偶性。此外，也无法对复合价值函数应用标准的基于鲁棒价值迭代的方法，因为最坏情况模型对于奖励价值函数和约束价值函数可能不同。我们提出了一种新颖技术：一方面有效最小化约束价值函数以满足约束；另一方面，当所有约束均满足时，它可以简单地最大化鲁棒奖励价值函数。我们证明该算法在$O(ε^{-2})$次迭代后能找到至多具有$ε$次优性且可行的策略。与现有最先进方法相比，我们无需执行二分搜索，因此对于较小的折扣因子（$γ$）值，计算时间至少减少4倍；对于较大的$γ$值，计算时间至少减少6倍。