In this paper, we propose a general theory of ambiguity-averse MDPs, which treats the uncertain transition probabilities as random variables and evaluates a policy via a risk measure applied to its random return. This ambiguity-averse MDP framework unifies several models of MDPs with epistemic uncertainty for specific choices of risk measures. We extend the concepts of value functions and Bellman operators to our setting. Based on these objects, we establish the consequences of dynamic programming principles in this framework (existence of stationary policies, value and policy iteration algorithms), and we completely characterize law-invariant risk measures compatible with dynamic programming. Our work draws connections among several variants of MDP models and fully delineates what is possible under the dynamic programming paradigm and which risk measures require leaving it.
翻译:本文提出了一种规避模糊性的MDP通用理论,该理论将不确定的转移概率视为随机变量,并通过应用于随机回报的风险测度来评估策略。这种规避模糊性的MDP框架通过特定风险测度的选择,统一了多种具有认知不确定性的MDP模型。我们将值函数与贝尔曼算子的概念扩展至该框架。基于这些数学对象,我们在此框架中建立了动态规划原理的推论(确定性策略的存在性、值迭代与策略迭代算法),并完整刻画了与动态规划兼容的律不变风险测度。本研究建立了多种MDP变体模型间的理论联系,完整界定了动态规划范式下的可行范畴,并明确了何种风险测度需要突破该范式。