Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to 1). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, that history-dependent (Markovian) policies strictly outperform stationary policies for average optimality in s-rectangular RMDPs. We also study Blackwell optimality for sa-rectangular RMDPs, where we show that {\em approximate} Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games.
翻译:鲁棒马尔可夫决策过程(RMDPs)是一种广泛用于处理参数不确定性下序贯决策的框架。当目标为最大化折扣回报时,RMDPs已得到广泛研究,但关于平均最优性(优化长期时间平均累积奖励)和Blackwell最优性(在充分接近1的所有折扣因子下保持折扣最优性)的研究仍十分有限。本文证明了超越折扣回报的RMDPs的若干基础性结果。我们表明,对于s-矩形RMDPs,平均最优策略可选取为平稳确定性策略,但令人惊讶的是,对于s-矩形RMDPs,历史依赖型(马尔可夫)策略在平均最优性方面严格优于平稳策略。我们还研究了s-矩形RMDPs的Blackwell最优性,证明尽管Blackwell最优策略可能不存在,但近似Blackwell最优策略始终存在。我们进一步给出了其存在的充分条件,该条件几乎涵盖了文献中的所有实例。随后,我们探讨了平均最优性与Blackwell最优性之间的联系,并描述了多种计算最优平均回报的算法。值得注意的是,我们的方法利用了RMDPs与随机博弈之间的关联。