A basic model in sequential decision making is the Markov decision process (MDP), which is extended to Robust MDPs (RMDPs) by allowing uncertainty in transition probabilities and optimizing against the worst-case transition probabilities from the uncertainty sets. The class of $(s, a)$-rectangular RMDPs with $L_p$ uncertainty sets provides a flexible and expressive model for such problems. We study this class of RMDPs with a discounted-sum cost criterion and a constant discount factor. The existence of an efficient algorithm for this class is a fundamental theoretical question in optimization and sequential decision making. Previous results only establish a strongly polynomial-time algorithm for $L_\infty$ uncertainty sets. In this work, our main results are as follows: (a)~we show that for any compact uncertainty set, the policy iteration algorithm for RMDPs is strongly polynomial with oracle access to solutions of Robust Markov chains (RMCs); (b)~we present strongly polynomial-time bounds on the policy iteration algorithm for RMCs with $L_1$ and $L_\infty$ uncertainty sets; and (c)~we establish hardness results for RMCs with $L_p$ uncertainty sets for integer $p$ satisfying $1<p<\infty$. Finally, motivated by our theoretical bounds, we present experimental results showing how fast policy iteration converges for RMDPs with $L_1$ and $L_\infty$ uncertainty sets.
翻译:序贯决策中的基本模型是马尔可夫决策过程(MDP),通过允许转移概率存在不确定性并针对不确定集中的最坏情况转移概率进行优化,可将其推广为鲁棒MDP(RMDP)。具有$L_p$不确定集的$(s,a)$-矩形RMDP为该类问题提供了灵活且富有表现力的模型。我们以折扣和成本准则及恒定折扣因子研究此类RMDP。为该类问题设计高效算法是优化与序贯决策领域的基础理论问题。先前结果仅针对$L_\infty$不确定集建立了强多项式时间算法。本文主要贡献如下:(a)证明对任意紧致不确定集,RMDP的策略迭代算法在可对鲁棒马尔可夫链(RMC)解进行黑箱访问时具有强多项式时间复杂性;(b)针对具有$L_1$和$L_\infty$不确定集的RMC,给出了策略迭代算法的强多项式时间界;(c)对于满足$1<p<\infty$的整数$p$,建立了$L_p$不确定集下RMC的难解性结果。最后,受理论界的启发,我们通过实验展示了策略迭代在$L_1$与$L_\infty$不确定集下RMDP中的快速收敛性能。