Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s-rectangularity assumptions. By using entropic regularization and exponential change of variables, we derive a convex formulation with a number of variables and constraints polynomial in the number of states and actions, but with large coefficients in the constraints. We further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or entropy-based uncertainty sets, showing that, in these cases, RMDPs can be reformulated as conic programs based on exponential cones, quadratic cones, and non-negative orthants. Our work opens a new research direction for RMDPs and can serve as a first step toward obtaining a tractable convex formulation of RMDPs.
翻译:鲁棒马尔可夫决策过程(RMDP)用于不确定环境下的动态优化应用,并已得到广泛研究。马尔可夫决策过程(MDP)的许多主要性质和算法(如值迭代和策略迭代)可直接推广至RMDP。令人惊讶的是,目前尚无类似于MDP凸优化公式的方法来求解RMDP。本文首次描述了在经典s-矩形性和矩形性假设下RMDP的凸优化公式。通过使用熵正则化和指数变量替换,我们推导出一个变量和约束数量与状态和动作数量成多项式关系但约束中系数较大的凸公式。我们进一步简化了具有多面体、椭球或基于熵的不确定性集的RMDP公式,表明在这些情况下,RMDP可被重新表述为基于指数锥、二次锥和非负卦限的锥规划问题。我们的工作为RMDP开辟了新的研究方向,并可作为获得RMDP可处理凸公式的第一步。