The parameters for a Markov Decision Process (MDP) often cannot be specified exactly. Uncertain MDPs (UMDPs) capture this model ambiguity by defining sets which the parameters belong to. Minimax regret has been proposed as an objective for planning in UMDPs to find robust policies which are not overly conservative. In this work, we focus on planning for Stochastic Shortest Path (SSP) UMDPs with uncertain cost and transition functions. We introduce a Bellman equation to compute the regret for a policy. We propose a dynamic programming algorithm that utilises the regret Bellman equation, and show that it optimises minimax regret exactly for UMDPs with independent uncertainties. For coupled uncertainties, we extend our approach to use options to enable a trade off between computation and solution quality. We evaluate our approach on both synthetic and real-world domains, showing that it significantly outperforms existing baselines.
翻译:马尔可夫决策过程(MDP)的参数往往无法精确指定。不确定MDP(UMDP)通过定义参数所属的集合来捕捉这种模型模糊性。最小最大遗憾已被提出作为UMDP中规划的目标,以寻求鲁棒性策略,这些策略不会过于保守。在本工作中,我们聚焦于具有不确定成本和转移函数的随机最短路径(SSP)UMDP的规划问题。我们引入一个贝尔曼方程来计算策略的遗憾值,并提出一种利用该遗憾贝尔曼方程的动态规划算法,证明该算法能在独立不确定性的UMDP中精确优化最小最大遗憾。针对耦合不确定性,我们扩展了该方法,采用选项(options)来实现在计算量与解质量之间的权衡。我们在合成数据集和真实场景领域评估了该方法,结果表明其显著优于现有基线方法。