Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to ). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they exist, may need to be history-dependent (Markovian). We also study Blackwell optimality for sa-rectangular RMDPs, where we show that $\epsilon$-Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.
翻译:鲁棒马尔可夫决策过程(RMDPs)是用于参数不确定性下序贯决策的广泛采用框架。当目标为最大化折扣回报时,RMDPs已得到深入研究,但对于平均最优性(优化随时间所获奖励的长期平均值)和布莱克维尔最优性(对所有充分接近1的折扣因子保持折扣最优性)却知之甚少。本文证明了超越折扣回报的RMDPs若干基础性结论。我们证明对于sa-矩形RMDPs,平均最优策略可选择为平稳确定性策略;但出人意料的是,对于s-矩形RMDPs,平均最优策略可能不存在,若存在则可能需要依赖历史(马尔可夫性)。我们还研究了sa-矩形RMDPs的布莱克维尔最优性,证明$\epsilon$-布莱克维尔最优策略始终存在,尽管布莱克维尔最优策略可能不存在。我们进一步给出了涵盖文献中几乎所有实例的存在性充分条件。随后探讨了平均最优性与布莱克维尔最优性之间的关联,并阐述了计算最优平均回报的多种算法。值得注意的是,我们的方法利用了RMDPs与随机博弈之间的内在联系。总体而言,本文通过平均最优性与布莱克维尔最优性的分析,凸显了基于距离的sa-矩形模型相较于s-矩形模型更具优越的实际特性。